The exchange from xiaoyeye1117

exchange's Introduction

An efficient C++ implementation of the exchange word clustering algorithm

Optimizes bigram perplexity by swapping words between classes. The evaluations can be done in parallel in multiple threads. Word-class and Class-word statistics are used for efficiency.

One sentence per line is assumed. Sentence begin and end markers (<s> and </s>) are added to each line if not present in the corpus. Perplexity values include the sentence end symbol.

For more details:
Martin, Liermann, Ney: Algorithms for bigram and trigram word clustering, Speech Communication 1998
Botros, Irie, Sundermeyer, Ney: On efficient training of word classes and their application to recurrent neural network language models, Interspeech 2015

For converting the training and possible development sets to class sequences, the class_corpus.py script is provided.
To handle possible out-of-vocabulary words in these sets, the unk symbol must be specified.
If the unk symbol is written in capitals (for instance VariKN), use the --cap_unk switch and for lowercase
unk symbol (for instance SRILM) use the --lc_unk switch.
ngramppl/classppl/classintppl checks for the correct unk symbol.

Usage example:
exchange -c 1000 -a 1000 -m 10000 -t 2 corpus.txt exchange.c1000
class_corpus.py --cap_unk exchange.c1000.cmemprobs.gz <train.txt >train.classes.txt
class_corpus.py --cap_unk exchange.c1000.cmemprobs.gz <devel.txt >devel.classes.txt
varigram_kn -3 -C -Z -a -n 5 -D 0.02 -E 0.04 -o devel.classes.txt train.classes.txt exchange.vkn.5g.arpa.gz
classppl exchange.vkn.5g.arpa.gz exchange.c1000.cmemprobs.gz eval.txt

Recommend Projects

xiaoyeye1117 / exchange Goto Github PK

exchange's Introduction

exchange's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent