Git Product home page Git Product logo

exchange's Introduction

An efficient C++ implementation of the exchange word clustering algorithm

Optimizes bigram perplexity by swapping words between classes. The evaluations can be done in parallel in multiple threads. Word-class and Class-word statistics are used for efficiency.

One sentence per line is assumed. Sentence begin and end markers (<s> and </s>) are added to each line if not present in the corpus. Perplexity values include the sentence end symbol.

For more details:
Martin, Liermann, Ney: Algorithms for bigram and trigram word clustering, Speech Communication 1998
Botros, Irie, Sundermeyer, Ney: On efficient training of word classes and their application to recurrent neural network language models, Interspeech 2015

For converting the training and possible development sets to class sequences, the class_corpus.py script is provided.
To handle possible out-of-vocabulary words in these sets, the unk symbol must be specified.
If the unk symbol is written in capitals (for instance VariKN), use the --cap_unk switch and for lowercase
unk symbol (for instance SRILM) use the --lc_unk switch.
ngramppl/classppl/classintppl checks for the correct unk symbol.

Usage example:
exchange -c 1000 -a 1000 -m 10000 -t 2 corpus.txt exchange.c1000
class_corpus.py --cap_unk exchange.c1000.cmemprobs.gz <train.txt >train.classes.txt
class_corpus.py --cap_unk exchange.c1000.cmemprobs.gz <devel.txt >devel.classes.txt
varigram_kn -3 -C -Z -a -n 5 -D 0.02 -E 0.04 -o devel.classes.txt train.classes.txt exchange.vkn.5g.arpa.gz
classppl exchange.vkn.5g.arpa.gz exchange.c1000.cmemprobs.gz eval.txt

exchange's People

Contributors

shadowrock avatar senarvi avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.