Git Product home page Git Product logo

kenlm's Introduction

kenlm

Language model inference code by Kenneth Heafield (kenlm at kheafield.com)

The website https://kheafield.com/code/kenlm/ has more documentation. If you're a decoder developer, please download the latest version from there instead of copying from another decoder.

Compiling

Use cmake, see BUILDING for build dependencies and more detail.

mkdir -p build
cd build
cmake ..
make -j 4

Compiling with your own build system

If you want to compile with your own build system (Makefile etc) or to use as a library, there are a number of macros you can set on the g++ command line or in util/have.hh .

  • KENLM_MAX_ORDER is the maximum order that can be loaded. This is done to make state an efficient POD rather than a vector.
  • HAVE_ICU If your code links against ICU, define this to disable the internal StringPiece and replace it with ICU's copy of StringPiece, avoiding naming conflicts.

ARPA files can be read in compressed format with these options:

  • HAVE_ZLIB Supports gzip. Link with -lz.
  • HAVE_BZLIB Supports bzip2. Link with -lbz2.
  • HAVE_XZLIB Supports xz. Link with -llzma.

Note that these macros impact only read_compressed.cc and read_compressed_test.cc. The bjam build system will auto-detect bzip2 and xz support.

Estimation

lmplz estimates unpruned language models with modified Kneser-Ney smoothing. After compiling with bjam, run

bin/lmplz -o 5 <text >text.arpa

The algorithm is on-disk, using an amount of memory that you specify. See https://kheafield.com/code/kenlm/estimation/ for more.

MT Marathon 2012 team members Ivan Pouzyrevsky and Mohammed Mediani contributed to the computation design and early implementation. Jon Clark contributed to the design, clarified points about smoothing, and added logging.

Filtering

filter takes an ARPA or count file and removes entries that will never be queried. The filter criterion can be corpus-level vocabulary, sentence-level vocabulary, or sentence-level phrases. Run

bin/filter

and see https://kheafield.com/code/kenlm/filter/ for more documentation.

Querying

Two data structures are supported: probing and trie. Probing is a probing hash table with keys that are 64-bit hashes of n-grams and floats as values. Trie is a fairly standard trie but with bit-level packing so it uses the minimum number of bits to store word indices and pointers. The trie node entries are sorted by word index. Probing is the fastest and uses the most memory. Trie uses the least memory and is a bit slower.

As is the custom in language modeling, all probabilities are log base 10.

With trie, resident memory is 58% of IRST's smallest version and 21% of SRI's compact version. Simultaneously, trie CPU's use is 81% of IRST's fastest version and 84% of SRI's fast version. KenLM's probing hash table implementation goes even faster at the expense of using more memory. See https://kheafield.com/code/kenlm/benchmark/.

Binary format via mmap is supported. Run ./build_binary to make one then pass the binary file name to the appropriate Model constructor.

Platforms

murmur_hash.cc and bit_packing.hh perform unaligned reads and writes that make the code architecture-dependent.
It has been sucessfully tested on x86_64, x86, and PPC64.
ARM support is reportedly working, at least on the iphone.

Runs on Linux, OS X, Cygwin, and MinGW.

Hideo Okuma and Tomoyuki Yoshimura from NICT contributed ports to ARM and MinGW.

Decoder developers

  • I recommend copying the code and distributing it with your decoder. However, please send improvements upstream.

  • It's possible to compile the query-only code without Boost, but useful things like estimating models require Boost.

  • Select the macros you want, listed in the previous section.

  • There are two build systems: compile.sh and cmake. They're pretty simple and are intended to be reimplemented in your build system.

  • Use either the interface in lm/model.hh or lm/virtual_interface.hh. Interface documentation is in comments of lm/virtual_interface.hh and lm/model.hh.

  • There are several possible data structures in model.hh. Use RecognizeBinary in binary_format.hh to determine which one a user has provided. You probably already implement feature functions as an abstract virtual base class with several children. I suggest you co-opt this existing virtual dispatch by templatizing the language model feature implementation on the KenLM model identified by RecognizeBinary. This is the strategy used in Moses and cdec.

  • See lm/config.hh for run-time tuning options.

Contributors

Contributions to KenLM are welcome. Please base your contributions on https://github.com/kpu/kenlm and send pull requests (or I might give you commit access). Downstream copies in Moses and cdec are maintained by overwriting them so do not make changes there.

Python module

Contributed by Victor Chahuneau.

Installation

pip install https://github.com/kpu/kenlm/archive/master.zip

When installing pip, the MAX_ORDER environment variable controls the max order with which KenLM was built.

Basic Usage

import kenlm
model = kenlm.Model('lm/test.arpa')
print(model.score('this is a sentence .', bos = True, eos = True))

See python/example.py and python/kenlm.pyx for more, including stateful APIs.

Building kenlm - Using vcpkg

You can download and install kenlm using the vcpkg dependency manager:

git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg integrate install
./vcpkg install kenlm

The kenlm port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please create an issue or pull request on the vcpkg repository.


The name was Hieu Hoang's idea, not mine.

kenlm's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kenlm's Issues

Does not compile due to a linking error with Boost

Part of the output copied below. According to the home page, "Estimation and filtering require Boost at least 1.36.0 and zlib." I have boost 1.46.1 and get the following linking error.

So basically my question is: what version of boost works?

...failed gcc.link util/bin/gcc-4.6/release/link-static/threading-multi/bit_packing_test...
...skipped <putil/bin/gcc-4.6/release/link-static/threading-multi>bit_packing_test.passed for lack of <putil/bin/gcc-4.6/release/link-static/threading-multi>bit_packing_test...
gcc.link util/bin/gcc-4.6/release/link-static/threading-multi/joint_sort_test
util/bin/gcc-4.6/release/link-static/threading-multi/joint_sort_test.o: In function `main':
joint_sort_test.cc:(.text.startup+0xb): undefined reference to `boost::unit_test::unit_test_main(bool (*)(), int, char**)'
collect2: ld returned 1 exit status

    "g++"    -o "util/bin/gcc-4.6/release/link-static/threading-multi/joint_sort_test" -Wl,--start-group "util/bin/gcc-4.6/release/link-static/threading-multi/joint_sort_test.o" "util/bin/gcc-4.6/release/link-static/threading-multi/parallel_read.o" "util/bin/gcc-4.6/release/link-static/threading-multi/read_compressed.o" "util/double-conversion/bin/gcc-4.6/release/link-static/threading-multi/diy-fp.o" "util/double-conversion/bin/gcc-4.6/release/link-static/threading-multi/fixed-dtoa.o" "util/double-conversion/bin/gcc-4.6/release/link-static/threading-multi/bignum.o" "util/double-conversion/bin/gcc-4.6/release/link-static/threading-multi/strtod.o" "util/double-conversion/bin/gcc-4.6/release/link-static/threading-multi/double-conversion.o" "util/double-conversion/bin/gcc-4.6/release/link-static/threading-multi/bignum-dtoa.o" "util/double-conversion/bin/gcc-4.6/release/link-static/threading-multi/fast-dtoa.o" "util/double-conversion/bin/gcc-4.6/release/link-static/threading-multi/cached-powers.o" "util/bin/gcc-4.6/release/link-static/threading-multi/bit_packing.o" "util/bin/gcc-4.6/release/link-static/threading-multi/ersatz_progress.o" "util/bin/gcc-4.6/release/link-static/threading-multi/exception.o" "util/bin/gcc-4.6/release/link-static/threading-multi/file.o" "util/bin/gcc-4.6/release/link-static/threading-multi/file_piece.o" "util/bin/gcc-4.6/release/link-static/threading-multi/mmap.o" "util/bin/gcc-4.6/release/link-static/threading-multi/murmur_hash.o" "util/bin/gcc-4.6/release/link-static/threading-multi/pool.o" "util/bin/gcc-4.6/release/link-static/threading-multi/scoped.o" "util/bin/gcc-4.6/release/link-static/threading-multi/string_piece.o" "util/bin/gcc-4.6/release/link-static/threading-multi/usage.o"  -Wl,-Bstatic -lboost_system-mt -lboost_system-mt -lboost_unit_test_framework-mt -lboost_thread-mt -lz -Wl,-Bdynamic -lSegFault -lrt -Wl,--end-group -pthread 


...failed gcc.link util/bin/gcc-4.6/release/link-static/threading-multi/joint_sort_test...
...skipped <putil/bin/gcc-4.6/release/link-static/threading-multi>joint_sort_test.passed for lack of <putil/bin/gcc-4.6/release/link-static/threading-multi>joint_sort_test...
...failed updating 12 targets...
...skipped 16 targets...

Possible Cause of BadDiscountException?

.../libs/kenlm/lm/builder/adjust_counts.cc:61 in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const lm::builder::DiscountConfig&) threw BadDiscountException because `discounts_[i].amount[j] < 0.0 || discounts_[i].amount[j] > j'.
ERROR: 1-gram discount out of range for adjusted count 2: -1.6000001
Aborted (core dumped)

What could have happened to cause this error? We preprocessed the files to limit to 10k vocab (replace out of vocab words with ). The files are sufficiently big enough (with line breaks, thanks to the help in the other thread), some output info:

Unigram tokens 77187240 types 10002
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:120024 2:37614993408 3:70528114688
ERROR: 1-gram discount out of range for adjusted count 2: -1.75
Aborted (core dumped)

Did fixing the vocabulary externally cause this problem?

Missing documentation for base of log probabilities

Hello. Firstly thanks for this great tool. The Python support has made this very easy to use alongside nltk for some recent research.

I'm having difficulty finding documentation for the probabilities from model.full_scores(). They appear to be log probabilities, but I'm unsure of which base?

Scanning through the repository, I found this line that seems to indicate that it is base 10:

base_instance_->set_log_base(10.0);

But I can't find any other reason to confirm that this is the case. Thanks.

Process aborted during ngram estimation

This is the full log:

➜  bin/lmplz -o 5 -S 50% -T /tmp <~/data/enwiki-latest-pages-articles >text.arpa 

=== 1/5 Counting and sorting n-grams ===
Reading /home/deeppixel/data/enwiki-latest-pages-articles
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 4027024634 types 8571832
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:102861984 2:1237697920 3:2320683776 4:3713093888 5:5414928896
Statistics:
1 8571831 D1=0.682343 D2=1.02373 D3+=1.37025
2 208792530 D1=0.747714 D2=1.07416 D3+=1.35152
3 871078563 D1=0.826502 D2=1.1214 D3+=1.3274
4 1692737525 D1=0.88864 D2=1.18124 D3+=1.33282
5 2308548475 D1=0.874941 D2=1.29421 D3+=1.3912
Memory estimate for binary LM:
type     GB
probing 100 assuming -p 1.5
probing 116 assuming -r models -p 1.5
trie     53 without quantization
trie     31 assuming -q 8 -b 8 quantization 
trie     46 assuming -a 22 array pointer compression
trie     24 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:102861972 2:1329358848 3:2492548096 4:3988076544 5:5815945216
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:102861972 2:910337664 3:1706883200 4:2731013120 5:3982727424
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
**********************************************************************---Last input should have been poison.
[1]    6802 abort      bin/lmplz -o 5 -S 50% -T /tmp < ~/data/enwiki-latest-pages-articles > 

installing KenLM

Running on OSX, boost 1.55
I'm essentially following the instructions in this document:
http://victor.chahuneau.fr/notes/2012/07/03/kenlm.html

Interestingly, everything worked once, but then stopped working. When I input ./bjam, I get a couple of errors, one involving a broken pipe, but the one I'm most concerned about is:

-bash: ./kenlm/bin/lmplz: No such file or directory

The output begins with: warning: No toolsets are configured.
warning: Configuring default toolset "darwin".
warning: If the default is wrong, your build may not work correctly.
warning: Use the "toolset=xxxxx" option to override our guess.
warning: For more configuration options, please consult
warning: http://boost.org/boost-build2/doc/html/bbv2/advanced/configuration.html
...patience...
...found 628 targets...
...updating 38 targets...

and at the end, the output is...

...failed darwin.link lm/bin/left_test.test/darwin-5.1.0/release/threading-multi/left_test...
...skipped <plm/bin/left_test.test/darwin-5.1.0/release/threading-multi>left_test.run for lack of <plm/bin/left_test.test/darwin-5.1.0/release/threading-multi>left_test...
...failed updating 23 targets...
...skipped 15 targets...

If I could attach the log I would but it's very long!

Feature: Build KenLM model from n-gram counts file

Hi 👋 ,

it would be nice to learn language models on existing count files that contain n-gram counts. Similar to the ngram-count parameter -read from SRILM. The ability to only load those counts, enables the use of essentially unlimited n-gram statistics like skip-ngram.

Issue compiling on OS X El Capitan

Compiling kenlm on OS X El Capitan with ./bjam yields the following output – any suggestions?

I have also installed Boost via Homebrew.

Using 'darwin' toolset.

rm -rf bootstrap
mkdir bootstrap
cc -o bootstrap/jam0 command.c compile.c constants.c debug.c execcmd.c frames.c function.c glob.c hash.c hdrmacro.c headers.c jam.c jambase.c jamgram.c lists.c make.c make1.c object.c option.c output.c parse.c pathsys.c regexp.c rules.c scan.c search.c subst.c timestamp.c variable.c modules.c strings.c filesys.c builtins.c class.c cwd.c native.c md5.c w32_getreg.c modules/set.c modules/path.c modules/regex.c modules/property-set.c modules/sequence.c modules/order.c execunix.c fileunix.c pathunix.c
make.c:296:37: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( "make\t--\t%s%s\n", spaces( depth ), object_str( t->name ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:296:37: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:303:37: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( "make\t--\t%s%s\n", spaces( depth ), object_str( t->name ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:303:37: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:376:45: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( "bind\t--\t%s%s: %s\n", spaces( depth ),
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:376:45: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:384:45: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( "time\t--\t%s%s: %s\n", spaces( depth ),
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:384:45: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:389:45: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( "time\t--\t%s%s: %s\n", spaces( depth ),
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:389:45: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:731:13: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
spaces( depth ), object_str( t->name ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:731:13: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

6 warnings generated.
modules/path.c:16:12: warning: implicit declaration of function 'file_query' is invalid in C99
[-Wimplicit-function-declaration]
return file_query( list_front( lol_get( frame->args, 0 ) ) ) ?
^
1 warning generated.
./bootstrap/jam0 -f build.jam --toolset=darwin --toolset-root= clean
...found 1 target...
...updating 1 target...
...updated 1 target...
./bootstrap/jam0 -f build.jam --toolset=darwin --toolset-root=
...found 139 targets...
...updating 3 targets...
[MKDIR] bin.macosxx86_64
[COMPILE] bin.macosxx86_64/b2
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: argument unused during compilation: '-finline-functions'
make.c:296:37: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( "make\t--\t%s%s\n", spaces( depth ), object_str( t->name ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:296:37: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:303:37: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( "make\t--\t%s%s\n", spaces( depth ), object_str( t->name ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:303:37: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:376:45: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( "bind\t--\t%s%s: %s\n", spaces( depth ),
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:376:45: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:384:45: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( "time\t--\t%s%s: %s\n", spaces( depth ),
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:384:45: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:389:45: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( "time\t--\t%s%s: %s\n", spaces( depth ),
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:389:45: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:731:13: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
spaces( depth ), object_str( t->name ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:731:13: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:768:43: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( "->%s%2d Name: %s\n", spaces( depth ), depth, target_name( t
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:768:43: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:772:43: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s%2d Name: %s\n", spaces( depth ), depth, target_name( t
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:772:43: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:778:38: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s Loc: %s\n", spaces( depth ), object_str( t->boundname )
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:778:38: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:784:42: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Stable\n", spaces( depth ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:784:42: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:787:41: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Newer\n", spaces( depth ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:787:41: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:790:56: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Up to date temp file\n", spaces( depth ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:790:56: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:793:65: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Temporary file, to be updated\n", spaces( depth )
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:793:65: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:797:61: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Been touched, updating it\n", spaces( depth ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:797:61: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:800:56: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Missing, creating it\n", spaces( depth ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:800:56: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:803:57: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Outdated, updating it\n", spaces( depth ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:803:57: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:806:56: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Rebuild, updating it\n", spaces( depth ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:806:56: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:809:47: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Updating it\n", spaces( depth ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:809:47: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:812:51: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Can not find it\n", spaces( depth ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:812:51: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:815:47: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Can make it\n", spaces( depth ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:815:47: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:821:34: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : ", spaces( depth ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:821:34: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

make.c:833:52: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Depends on %s (%s)", spaces( depth ),
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

make.c:833:52: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'

define spaces(x) ( " " + ( x > 20 ? 0 : 20-x ) )

                                       ^

22 warnings generated.
modules/path.c:16:12: warning: implicit declaration of function 'file_query' is invalid in C99 [-Wimplicit-function-declaration]
return file_query( list_front( lol_get( frame->args, 0 ) ) ) ?
^
1 warning generated.
[COPY] bin.macosxx86_64/bjam
...updated 3 targets...
~/Downloads/kenlm
Failed to run bash -c "g++ -dM -x c++ -E /dev/null -include boost/version.hpp 2>/dev/null |grep '#define BOOST_'"
Boost does not seem to be installed or g++ is confused.

clang error pip install inside virtualenv

I can install kenlm Python package outside of virtualenv but having trouble inside virtualenv.

Using Mac OS 10.11.4

nlp $ uname -a
Darwin Motokis-Macintosh.local 15.4.0 Darwin Kernel Version 15.4.0: Fri Feb 26 21:17:08 PST 2016; root:xnu-3248.40.184~2/RELEASE_X86_64 x86_64
nlp $ which clang
/usr/bin/clang
nlp $ clang --version
Apple LLVM version 7.3.0 (clang-703.0.31)
Target: x86_64-apple-darwin15.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Error message:

(nlp) nlp $ STATIC_DEPS=true pip install https://github.com/kpu/kenlm/archive/master.zip
Collecting https://github.com/kpu/kenlm/archive/master.zip
  Downloading https://github.com/kpu/kenlm/archive/master.zip (513kB)
    100% |████████████████████████████████| 522kB 636kB/s 
Installing collected packages: kenlm
  Running setup.py install for kenlm ... error
    Complete output from command /Users/apewu/smartannotations/nlp/bin/python2.7 -u -c "import setuptools, tokenize;__file__='/var/folders/d1/2291vfk93bq5l675mc1dy21m0000gn/T/pip-obzcbl-build/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /var/folders/d1/2291vfk93bq5l675mc1dy21m0000gn/T/pip-rSkucL-record/install-record.txt --single-version-externally-managed --compile --install-headers /Users/apewu/smartannotations/nlp/bin/../include/site/python2.7/kenlm:
    running install
    running build
    running build_ext
    building 'kenlm' extension
    creating build
    creating build/temp.macosx-10.11-x86_64-2.7
    creating build/temp.macosx-10.11-x86_64-2.7/util
    creating build/temp.macosx-10.11-x86_64-2.7/lm
    creating build/temp.macosx-10.11-x86_64-2.7/util/double-conversion
    creating build/temp.macosx-10.11-x86_64-2.7/python
    clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/bit_packing.cc -o build/temp.macosx-10.11-x86_64-2.7/util/bit_packing.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
    clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/ersatz_progress.cc -o build/temp.macosx-10.11-x86_64-2.7/util/ersatz_progress.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
    clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/exception.cc -o build/temp.macosx-10.11-x86_64-2.7/util/exception.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
    clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/file.cc -o build/temp.macosx-10.11-x86_64-2.7/util/file.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
    clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/file_piece.cc -o build/temp.macosx-10.11-x86_64-2.7/util/file_piece.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
    util/file_piece.cc:37:1: warning: control reaches end of non-void function [-Wreturn-type]
    }
    ^
    In file included from util/file_piece.cc:3:
    In file included from ./util/double-conversion/double-conversion.h:31:
    ./util/double-conversion/utils.h:302:16: warning: unused typedef 'VerifySizesAreEqual' [-Wunused-local-typedef]
      typedef char VerifySizesAreEqual[sizeof(Dest) == sizeof(Source) ? 1 : -1]
                   ^
    2 warnings generated.
    clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/float_to_string.cc -o build/temp.macosx-10.11-x86_64-2.7/util/float_to_string.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
    In file included from util/float_to_string.cc:3:
    In file included from ./util/double-conversion/double-conversion.h:31:
    ./util/double-conversion/utils.h:302:16: warning: unused typedef 'VerifySizesAreEqual' [-Wunused-local-typedef]
      typedef char VerifySizesAreEqual[sizeof(Dest) == sizeof(Source) ? 1 : -1]
                   ^
    1 warning generated.
    clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/integer_to_string.cc -o build/temp.macosx-10.11-x86_64-2.7/util/integer_to_string.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
    clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/mmap.cc -o build/temp.macosx-10.11-x86_64-2.7/util/mmap.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
    util/mmap.cc:246:15: warning: unused variable 'from_size' [-Wunused-variable]
      std::size_t from_size = mem.size();
                  ^
    1 warning generated.
    clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/murmur_hash.cc -o build/temp.macosx-10.11-x86_64-2.7/util/murmur_hash.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
    clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/parallel_read.cc -o build/temp.macosx-10.11-x86_64-2.7/util/parallel_read.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
    clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/pool.cc -o build/temp.macosx-10.11-x86_64-2.7/util/pool.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
    clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/read_compressed.cc -o build/temp.macosx-10.11-x86_64-2.7/util/read_compressed.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
    util/read_compressed.cc:24:10: fatal error: 'lzma.h' file not found
    #include <lzma.h>
             ^
    1 error generated.
    error: command 'clang' failed with exit status 1

    ----------------------------------------
Command "/Users/apewu/smartannotations/nlp/bin/python2.7 -u -c "import setuptools, tokenize;__file__='/var/folders/d1/2291vfk93bq5l675mc1dy21m0000gn/T/pip-obzcbl-build/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /var/folders/d1/2291vfk93bq5l675mc1dy21m0000gn/T/pip-rSkucL-record/install-record.txt --single-version-externally-managed --compile --install-headers /Users/apewu/smartannotations/nlp/bin/../include/site/python2.7/kenlm" failed with error code 1 in /var/folders/d1/2291vfk93bq5l675mc1dy21m0000gn/T/pip-obzcbl-build/

libkenlm.so ends with free(): invalid pointer

Hi,

I've tried to use kenlm as a library form in my decoder. However, libkenlm.so gives unexpected results.

You can reproduce my situation as follows. Assume kenlm is compiled.

cd </path/to/kenlm/lm>
g++ -DKENLM_MAX_ORDER=2 -I../  -c -o query_main.o query_main.cc
 g++ -L../lib -o query_main query_main.o -lkenlm
export LD_LIBRARY_PATH=../lib
./query_main test.arpa

This rises a coredump.

My environment is Ubuntu 12.04.2, with g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3.

"bjam install" produces "unable to find file error"

Not sure if it is an issue, but right now I am not able to install the git clone on either windows (under cygwin) or linux machines. Both give me an error when running "./bjam install":

warning: mismatched versions of Boost.Build engine and core
warning: Boost.Build engine (bjam) is 2014.03.00
warning: Boost.Build core (at /usr/share/boost-build) is 2013.05-svn
error: Unable to find file or target named
error: 'prefix-include'
error: referred to from project at
error: '.'

Parsing ngrams as it is.

Hello, in the kenlm documents I found only one function to use: en_model.score(sentence).
Can you please provide detail description of functions available if there are?
I'm trying to parse unigram, birgram trigram probabilities from LM as they appear there.
For example LM contain the following lines. I need to have a function which will work like this: en_model.bigram_prob("too recognize") will return -4.923469
-4.923469 too recognised
-4.923469 too recognises
-4.923469 too recognize
-4.923469 too recommend
The same for unigrams and trigrams.

Does kenlm support such functionality ?

Thank you,
Zaven.

Segmentation faults with a small corpus

Hi,

I can't get KenLM working on my corpus.

I've followed the usual steps:
./bin/lmplz -T /tmp/ --text corpus.txt --arpa myarpa.arpa
./bin/build_binary myarpa.arpa my_probing_model.mmap

Then I tried the snippet from here:
https://kheafield.com/code/kenlm/developers/

With a TrieModel, it always ends with a segfault, regardless of MAX_ORDER. The error occurs here:

lm::ngram::trie::TrieSearch<lm::ngram::DontQuantize, lm::ngram::trie::DontBhiksha>::SetupMemory(unsigned char*, std::vector<unsigned long, std::allocator<unsigned long> > const&, lm::ngram::Config const&) ()

With a ProbingModel, I get a segfault only for MAX_ORDER < 5:

lm::ngram::detail::GenericModel<lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>, lm::ngram::ProbingVocabulary>::ResumeScore(unsigned int const*, unsigned int const*, unsigned char, unsigned long&, float*, unsigned char&, lm::FullScoreReturn&)

For MAX_ORDER = 5, the C++ program runs only with a couple of Valgrind errors:

==3445== Invalid write of size 8
==3445==    at 0x411B1A: lm::ngram::detail::GenericModel<lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>, lm::ngram::ProbingVocabulary>::GenericModel(char const*, lm::ngram::Config const&) (in /home/romain/dev/keukeyVoice/corpora/kenlm/ktest)
==3445==    by 0x409920: lm::ngram::ProbingModel::ProbingModel(char const*, lm::ngram::Config const&) (model.hh:136)

Invalid write of size 8
==3445==    at 0x43A06B: lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>::SetupMemory(unsigned char*, std::vector<unsigned long, std::allocator<unsigned long> > const&, lm::ngram::Config const&) (in /home/romain/dev/keukeyVoice/corpora/kenlm/ktest)
==3445==    by 0x411515: lm::ngram::detail::GenericModel<lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>, lm::ngram::ProbingVocabulary>::SetupMemory(void*, std::vector<unsigned long, std::allocator<unsigned long> > const&, lm::ngram::Config const&) (in /home/romain/dev/keukeyVoice/corpora/kenlm/ktest)
==3445==    by 0x411FC0: lm::ngram::detail::GenericModel<lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>, lm::ngram::ProbingVocabulary>::GenericModel(char const*, lm::ngram::Config const&) (in /home/romain/dev/keukeyVoice/corpora/kenlm/ktest)

But a JNA wrapper around the same snippet raises a "malloc(): memory corruption" when loading the model.

I tried with and without pruning, with order 2 and 3, both with KenLM from the download section and this of github. The size of the corpus is about 1Gb.
One peculiarity of the vocabulary is that it contains A LOT of words that are substring of other words of the vocabulary.

I'm aware that it's probably not enough information for proper debugging, but I would be interested to know whether the valgrind errors are ok and if you can suggest me anything to help me find the problem.

My system is Mint 17. The compilation succeeded with no warning.

Compute probabilities of all n-grams

Hi Kenneth!
I am now using kenlm to experiment with different language models. From time to time I need to compute conditional probabilities of all n-grams. Arpa files do not contain them all and there is a rule how to compute probabilities that are not explicitly listed. I wrote a simple 20 line python script that uses arpa package to do that. Basically what that package (arpa) does is it accepts an n-gram string and returns probability of a last word conditioned on prefix. Maybe I did something wrong but it takes "forever" to compute, for instance, all 5-gram probabilities even with hundreds of threads.
I am wondering what would be the best way to compute probabilities of all possible n-grams with kenlm? I looked through the code and your examples and I think something like this may work:

  1. Convert arpa to binary (probing?).
  2. Load that model.
  3. Write OpenMP parallel loop and use FullScoreForgotState to get probabilities.

Does it sound like something reasonable or is there a better way to do it?

Thanks,
Sergey.

Perplexity evaluated on full documents or full sentence?

Sorry about this question :(

I ran into some confusion that I always thought perplexity for a document is evaluated per sentence, then you do the average for all the sentences' perplexities in the document. Is this how KenLM implemented bin/query?

Or did KenLM evaluate the perplexity on the whole documents then normalize it by the length of the document?

Build systems

Can you comment on build systems?

bjam is the default and preferred. I see you have provided compile_query_only.sh, presumably for convenience for folks who don't want to bother with boost.

What about cmake? I see that that is added to the system and in fact just build Joshua using it, but it's not clear to me that this was the right thing to do. In particular, cmake does not seem to respond to environmental settings of e.g., KENLM_MAX_ORDER. Why is cmake present, and what is its intended use, and why is it included? It also litters files all over the place.

It seems I should revert to using bjam in my own build process.

(My goal is to make it easier to depend on KenLM. Ideally I'd like to package it as a submodule. I've already separated KenLM from Joshua's wrappers and it works well, apart from the build system complication).

(Caveat: I do not understand modern build systems.)

building python library

Hi, thank you for this nice tool and also thanks for providing a windows version.

I have to work on a sever with Windows Server 8 R2 and I had successfully built the KenLM itself with the project files in the widows folder. However, it always went to error when I was trying to install the Python library on this windows server.

P.S. I am mainly working on python, so KenLM training tool is not that urgent for me since I could train the data from other machine, so I just want to know how to install the python part.

Any help would be appreciated.

Update python module to accomodate renaming of base functions

Installing using pip no longer works since the changes made in 500406a

Pip install fails with the following errors:

python/kenlm.cpp:1430:59: error: ‘class lm::base::Model’ has no member named ‘Score’
python/kenlm.cpp:1450:57: error: ‘class lm::base::Model’ has no member named ‘Score’

python/kenlm.cpp:1637:74: error: ‘class lm::base::Model’ has no member named ‘FullScore’
python/kenlm.cpp:1693:72: error: ‘class lm::base::Model’ has no member named ‘FullScore’

'kenlm.Model' object has no attribute 'score'

I've just tried using KenLM, and hit an error.

>>> model = kenlm.Model('LM/en.europarl-nc.lm')
Loading the LM will be faster if you build a binary file.
Reading /Users/bittlingmayer/Desktop/sgnln2/private-SignalN-Research/tsiran/lm/LM/en.europarl-nc.lm
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
*The ARPA file is missing <unk>.  Substituting log10 probability -100.
***************************************************************************************************
>>> model.score('This is a test')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'kenlm.Model' object has no attribute 'score' 
>>> model
<Model from en.europarl-nc.lm>
>>> dir(model)
['BaseFullScore', 'BaseScore', 'BeginSentenceWrite', 'NullContextWrite', '__class__', '__contains__', '__delattr__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'order', 'path']

Any idea what I may be doing wrong?

If it makes a difference, I installed via pip and I'm using python 2.7 (anaconda).

Using KenLM with ngram order greater than 6

I rebuilt my kenlm with max order set to 10 by using:
cmake .. -DKENLM_MAX_ORDER=10
during the build process
, while building, and by updating the setup.py file:
ARGS = ['-O3', '-DNDEBUG', '-DKENLM_MAX_ORDER=10'].

Now, I'm able to use lmplz without an error to build a 7 gram Language model.

However, when trying to use the python interface, I still get the following error:

IOError: Cannot read model '../models/LM_7gram.klm' (lm/model.cc:49 in void lm::ngram::detail::(anonymous namespace)::CheckCounts(const std::vector<uint64_t> &) threw FormatLoadException because counts.size() > 6'. This model has order 7 but KenLM was compiled to support up to 6. If your build system supports changing KENLM_MAX_ORDER, change it there and recompile. In the KenLM tarball or Moses, use e.g. bjam --max-kenlm-order=6 -a'. Otherwise, edit lm/max_order.hh.)

compile error

platform : 64bit, Red Hat Enterprise Linux Server release 5.8 (Tikanga)
g++ : g++ (GCC) 4.1.2 20080704 (Red Hat 4.1.2-52)

You must use ./bjam if you want language model estimation, filtering, or support for compressed files (.gz, .bz2, .xz)
Compiling with g++ -I. -O3 -DNDEBUG -DKENLM_MAX_ORDER=6
./util/scoped.hh: In static member function 'static void util::scoped_c_forward<T, clean>::Close(T*) [with T = void, void (* clean)(T*) = free]':
./util/scoped.hh:28:   instantiated from 'util::scoped_base<T, Closer>::~scoped_base() [with T = void, Closer = util::scoped_c_forward<void, free>]'
./util/scoped.hh:55:   instantiated from here
./util/scoped.hh:70: internal compiler error: in build_call, at cp/call.c:321
Please submit a full bug report,
with preprocessed source if appropriate.
See <URL:http://bugzilla.redhat.com/bugzilla> for instructions.
Preprocessed source stored into /tmp/ccFSXFye.out file, please attach this to your bugreport.

equivalent to hidden-ngram

Hi Ken,

Is there a way to replicate with KenLM the workflow where we can build a LM as with the continuous-ngram-count and then query/process a text with hidden-ngram (given an hidden-vocab file) ?

Cheers,
Vince

PyPI Package

It's possible to upload the official repo into PyPI?

words distribution about language model

here i have a question about KenLM, i want to use following function:
assume i have a trained 3-gram language model, i want to
get probabilities of all words in vocabulary given a two-words sequence,
say, i have a two-words sequence "A B".
i want to get:
P(A|A B) P(B|A B) P(C|A B) P(D|A B) P(E|A B) and so on
does c++ or python provide this interface ? thanks a lot.

lack of basic functionality

Even the most basic function P(you | where are) cannot be computed,
full_scores("where are you") automatically append "" at the beginning of the phrase which is stupid.
If I really want to compute " where are you", I will append "" by myself.

Segmentation fault

Hi, I got a segmentation fault when running "lmplz -o 3 < text > arpa" on a corpus. Stack trace is pasted below. I've got lmplz running fine on several other corpora. The only thing special about this corpus is it contains a lot of duplicated sentences, don't know if this could cause the segmentation fault.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffca054700 (LWP 9216)]
0x00000000004856ca in lm::builder::NGram::IsMarked (this=0x7fffca053c20) at ./lm/builder/ngram.hh:77
77 return Value().count >> (sizeof(Value().count) * 8 - 1);
(gdb) bt
#0 0x00000000004856ca in lm::builder::NGram::IsMarked (this=0x7fffca053c20)

at ./lm/builder/ngram.hh:77

#1 0x000000000048e12a in lm::builder::NGram::CutoffCount (this=0x7fffca053c20)

at ./lm/builder/ngram.hh:93

#2 0x000000000048afa6 in lm::builder::(anonymous namespace)::PruneNGramStream::operator++ (

this=0x7fffca053c20) at /home/cfan/tools/kenlm/lm/builder/initial_probabilities.cc:74

#3 0x000000000048bb40 in lm::builder::(anonymous namespace)::MergeRight::Run (this=0x95bc78,

primary=...) at /home/cfan/tools/kenlm/lm/builder/initial_probabilities.cc:238

#4 0x000000000048df48 in util::stream::Thread::operator()<util::stream::ChainPosition, lm::builder::{anonymous}::MergeRight>(const util::stream::ChainPosition &, lm::builder::(anonymous namespace)::MergeRight &) (this=0x928170, position=..., worker=...) at ./util/stream/chain.hh:77
#5 0x000000000048ddf1 in boost::_bi::list2boost::_bi::value<util::stream::ChainPosition, boost::_bi::valuelm::builder::{anonymous}::MergeRight >::operator()boost::reference_wrapper<util::stream::Thread, boost::_bi::list0>(boost::_bi::type, boost::reference_wrapperutil::stream::Thread &, boost::_bi::list0 &, int) (this=0x95bc40, f=..., a=...) at /usr/include/boost/bind/bind.hpp:313
#6 0x000000000048dccf in boost::_bi::bind_t<void, boost::reference_wrapperutil::stream::Thread, boost::_bi::list2boost::_bi::value<util::stream::ChainPosition, boost::_bi::valuelm::builder::{anonymous}::MergeRight > >::operator()(void) (this=0x95bc38)

at /usr/include/boost/bind/bind_template.hpp:20

#7 0x000000000048dc34 in boost::detail::thread_data<boost::_bi::bind_t<void, boost::reference_wrapperutil::stream::Thread, boost::_bi::list2boost::_bi::value<util::stream::ChainPosition, boost::_bi::valuelm::builder::{anonymous}::MergeRight > > >::run(void) (this=0x95bab0)

at /usr/include/boost/thread/detail/thread.hpp:61

Question about model interpolation

Hello,
I know this isn't an issue but I didn't found anywhere else to ask.
I think there is no way ton interpolate plural arpa models into one as the SRILM and IRSTLM does. Is this a features planned or kenlm don't do it on purpose ?

Thank you anyway for the awesome job with kenlm !

Different scores for 4-gram and 5-gram LM on sentence whose length is 4

hi @kpu i have a question for you.

i train a 4-gram lm and a 5-gram lm on same corpus with the same configuration.

when i test the language model on a sentence, i found an unreasonable result:

for example, i have a sentence here:

m4 = kenlm.Model('4gram-lm')
m5 = kenlm.Model('5gram-lm')
sent_3 = 'bolivia holds presidential'
s4 = m4.score(sent_3, bos = False, eos = False)
s5 = m5.score(sent_3, bos = False, eos = False)

i test the language model score on a sentence whose length is 3, i got exactly same s4 and s5 which is reasonable.

s4: -13.948734283447266
s5: -13.948734283447266

but when i test on a sentence whose length is 4, strange thing happens:

sent_4 = 'bolivia holds presidential and'
s4 = m4.score(sent_4, bos = False, eos = False)
s5 = m5.score(sent_4, bos = False, eos = False)
s4: -8.61363410949707
s5: -8.647890090942383

i think, s4 and s5 should be same, as we can see, however, i got a little different s4 and s5 there:
because for string the length of which is 4, do not consider bos and eos

p4(w1 w2 w3 w4) = p(w1) * p(w2 | w1) * p(w3 | w1 w2) * p(w4 | w1 w2 w3)
p5(w1 w2 w3 w4) = p(w1) * p(w2 | w1) * p(w3 | w1 w2) * p(w4 | w1 w2 w3)

so p4 and p5 should be the same , right ? can you give me some explanations about this ?

of cause, for sentence whose length is 5, it will be different, because last items in following formula are different.

p4(w1 w2 w3 w4 w5) = p(w1) * p(w2 | w1) * p(w3 | w1 w2) * p(w4 | w1 w2 w3) * p(w5 | w2 w3 w4)
p5(w1 w2 w3 w4 w5) = p(w1) * p(w2 | w1) * p(w3 | w1 w2) * p(w4 | w1 w2 w3) * p(w5 | w1 w2 w3 w4)

Compression support for Python module

setup.py needs to be adjusted manually to add flags like HAVE_ZLIB under extra_compile_args section in order to be able to read compressed LM files.

Mavericks compatibility

I've run into issues trying to compile kenlm on Mavericks 10.9. Using the default clang provided by Xcode:

Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn)
Target: x86_64-apple-darwin13.0.0
Thread model: posix

everything seems to compile okay (a few tests fail), but when I go to train a model, I get:

jbg-hackintosh:simtrans jbg$ lmplz -o 3 -S 2G -T /tmp < scratch/lm/train-de > scratch/lm/train-de.arpa
=== 1/5 Counting and sorting n-grams ===
Reading stdin
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100


Function not implemented

I thought maybe clang was the issue, so I also tried with gcc 4.8 (via homebrew), which produces a linking error (which I won't copy here, as it may be a boost issue; haven't debugged fully). My student reproduced the same issue on his Mavericks laptop.

Is there a recommended path for building kenlm in 10.9?

LM without Probabilities including <s> <s/>

Hi,

I am using the tool to build a LM over Entity Grids. As it is obvious i am therefore not interested in including probabilities of n-gramms that contain the sentence boundaries. Is it possible to somehow achieve this? I still want to only calculate n-gramms within a sentence so making one big sentence would not solve the problem.

thanks! (especially for the great tool!)

error: 'features.h' file not found when `pip install` on mac

    clang -fno-strict-aliasing -fno-common -dynamic -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/file.cc -o build/temp.macosx-10.11-x86_64-2.7/util/file.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
    util/file.cc:32:10: fatal error: 'features.h' file not found
    #include <features.h>
             ^
    1 error generated.
    error: command 'clang' failed with exit status 1

image

2-gram discount out of range for adjusted count

I have two files. One file works fine with kenlm, the other gives the following error:

jbg-hackintosh:qblearn jbg$ lmplz -o 2 -S 2G -T -kndiscount /tmp < bl > scratch/Literature/10393.comb.arpa
=== 1/5 Counting and sorting n-grams ===
Reading stdin
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100


Unigram tokens 2366 types 1168
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:14016 2:2147469568
/Users/jbg/repositories/kenlm/lm/builder/adjust_counts.cc:50 in void lm::builder::::StatCollector::CalculateDiscounts() threw BadDiscountException because `discounts_[i].amount[j] < 0.0 || discounts_[i].amount[j] > j'.
ERROR: 2-gram discount out of range for adjusted count 3: -0.402645
Abort trap: 6

The only difference between the two files is that one ends with the sentence:

Lord Melbourne offered him a lordship, which he declined

I've also sent the full files to Kenneth via e-mail.

Possibility that smaller LM has better perplexity?

Hi,

I'm running KenLM on LM1B data (Language-Modeling 1 Billion), and for some weird reason, we see some perplexity goes down for extremely small model:

unigram tokens unigram types with OOV exclude OOV
38023755 | 337972 | 156.5 | 148.27
3807417 | 107563 | 247.5 | 215.76
380918 | 32879 | 398.2 | 283.43
37438 | 8406 | 522.2 | 253.29
3728 | 1640 | 392.2 | 118.1

As you can notice that when unigram tokens drop low (smallest model), perplexity magically dropped to 392.2.

How does KenLM calculate perplexity including OOV and excluding OOV?

Fails to compile with Boost 1.41

Looks like the required flag from Boost.Program_options is used, which was only added in 1.41. I guess the version requirement in CMakeLists.txt should be upped.

[ 85%] Building CXX object lm/CMakeFiles/partial_test.dir/partial_test.cc.o
/home/cortex-m40/kenlm/lm/kenlm_benchmark_main.cc: In function ‘int main(int, char**)’:
/home/cortex-m40/kenlm/lm/kenlm_benchmark_main.cc:200:51: error: ‘class boost::program_options::typed_value<std::basic_string<char>, char>’ has no member named ‘required’
       ("model,m", po::value<std::string>(&model)->required(), "Model to query or convert vocab ids")
                                                   ^
make[2]: *** [lm/CMakeFiles/kenlm_benchmark.dir/kenlm_benchmark_main.cc.o] Error 1
make[1]: *** [lm/CMakeFiles/kenlm_benchmark.dir/all] Error 2

Compilation Problem

I'm getting this error while trying to compile kenlm with make -j 4, please help.

undefined reference to `boost::unit_test::ut_detail::normalize_test_case_name

cppcheck

Hello! I'm new in open source, and like to help)
I check kenlm project on Cppcheck. Cppcheck is a static analysis tool for C/C++ code.
All error linked to "jam-files/engine". Can I fix these errors over pull request?
Or it is not used the code?

[jam-files/engine/compile.c:69]: (error) Buffer is accessed out of bounds.
[jam-files/engine/hcache.c:146]: (error) Common realloc mistake: 'buf' nulled but not freed upon failure
[jam-files/engine/lists.c:104]: (error) Pointer to local array variable returned.
[jam-files/engine/lists.c:135]: (error) Pointer to local array variable returned.
[jam-files/engine/lists.c:35]: (error) Allocation with malloc, return doesnt release it.
[jam-files/engine/make1.c:121]: (error) Allocation with malloc, return doesnt release it.
[jam-files/engine/mkjambase.c:73]: (error) Resource leak: fout
[jam-files/engine/modules/order.c:85]: (error) Memory leak: colors
[jam-files/engine/object.c:262]: (error) Memory leak: m
[jam-files/engine/regexp.c:255]: (error) Memory leak: r
[jam-files/engine/regexp.c:520]: (error) Uninitialized variable: classend
[jam-files/engine/regexp.c:521]: (error) Uninitialized variable: classr
[jam-files/engine/rules.c:552]: (error) Buffer is accessed out of bounds.
[jam-files/engine/yyacc.c:166]: (error) Memory leak: key.string
[jam-files/engine/yyacc.c:195]: (error) Resource leak: grammar_source_f

full list: http://pastebin.com/0AjCPcD

'limits' issue compiling on OSX

KenLM is sweet.

In order to compile it on OSX (10.8.3) I had to modify the 'limits' include in:

https://github.com/kpu/kenlm/blob/master/util/file.cc

I added one more include to the very top of this file:

#include <limits.h>

and everything suddenly compiled like magic. The '.h' was the secret sauce.

Brew is pretty nice for the boost stuff too. I was dreading this aspect, but:

$ brew install boost

just worked.

Cannot compile lmplz_main.cc

I'm using Ubuntu 12.04.
Previously tried to compile using Boost 1.46 but failed at all due to -lboost_exception was not exist in /usr/lib.
Then, tried to compile using Boost 1.55 (/usr/local) but lmplz_main always failed while others compiled successfully. Both source code from github and from http://kheafield.com/code/kenlm.tar.gz complains the same error.

gcc.compile.c++ /home/***/LM/kenlm/lm/builder/bin/gcc-4.6/release/link-static/threading-multi/lmplz_main.o
/home/***/LM/kenlm/lm/builder/lmplz_main.cc: In function ‘int main(int, char**)’:
/home/***/LM/kenlm/lm/builder/lmplz_main.cc:55:72: error: no matching function for call to ‘value(uint64_t*)’
/home/***/LM/kenlm/lm/builder/lmplz_main.cc:55:72: note: candidates are:
/usr/local/include/boost/program_options/detail/value_semantic.hpp:175:5: note: template<class T> boost::program_options::typed_value<T>* boost::program_options::value()
/usr/local/include/boost/program_options/detail/value_semantic.hpp:183:5: note: template<class T> boost::program_options::typed_value<T>* boost::program_options::value(T*)

"g++"  -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -pthread  -DKENLM_MAX_ORDER=6 -DNDEBUG  -I"." -I"util/double-conversion" -c -o "/home/***/LM/kenlm/lm/builder/bin/gcc-4.6/release/link-static/threading-multi/lmplz_main.o" "/home/***/LM/kenlm/lm/builder/lmplz_main.cc"

...failed gcc.compile.c++ /home/***/LM/kenlm/lm/builder/bin/gcc-4.6/release/link-static/threading-multi/lmplz_main.o...

So, what's wrong with my compilation since I'm new to this?

Continuous ngram count

Is there a nice way to emulate SRILM's continuous-ngram-count? My goal is to have markers for punctuation (such as commas, periods, exaclamation marks, etc.) and to be able to keep context across sentences.
Currently I put the whole text on one line, but it's not great memory wise.

internal complie error

Hi ~ I have some issue when I tried to compile mosesdecoder on RedHat 5.8 & gcc 4.1.2.
Any tip how to fix it?
Thank you very much!

./util/scoped.hh:70: internal compiler error: in build_call, at cp/call.c:321
Please submit a full bug report,
with preprocessed source if appropriate.
See <URL:http://bugzilla.redhat.com/bugzilla> for instructions.
Preprocessed source stored into /tmp/ccnex5v6.out file, please attach this to your bugreport.
...failed gcc.compile.c++ lm/bin/gcc-4.1.2/release/debug-symbols-on/link-static/threading-multi/quantize.o...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.