srvk / lm_build Goto Github PK

View Code? Open in Web Editor NEW

64.0 13.0 13.0 17.9 MB

Adapting your own Language Model for Kaldi

Home Page: http://speechkitchen.org/kaldi-language-model-building/

Shell 61.43% Perl 35.26% Python 3.31%

lm_build's Introduction

Kaldi Language Model Building

Adapting Your Own Language Model

Instructions to learn about building a Kaldi language model based on your own text.

When you clone this code into a Kaldi experiment like …/kaldi-trunk/egs/tedlium/s5 you get a folder lm_build/ with tools and examples of how to adapt and train a language model based on your own training text file.

Adding New Vocabulary Words to the Lexicon

The new script run_adapt.sh helps make LM adaptation much easier now.

Method 1: manually create a file newwords.txt in the lm_build working folder, into which you place new words (not already in the lexicon in TEDLIUM.152k.dic) Pronunciations will be automatically generated and added to the dictionary.
Method 2: Automatic candidate OOV words are generated when you run run_adapt.sh in the file candidate_oovs.txt. This candidate list of new words contains all words found in the training text not already in the dictionary (OOV words) that appear more than once. Rename this file newwords.txt and run run_adapt.sh again to use all these words with a frequency greater than 2. Or edit newwords.txt having a look at oov-counts.txt to see the word frequency counts and help you iteratively refine the dictionary
(optionally) add to the example_txt training text file some examples that use the new words. Hint: you may need to repeat these LM adaptation sentences between 50 and 100 times for the transcriber to recognize and produce them as output.
Run the script run_adapt.sh. This will do several things, but the end result will be a new composed decoding graph TLG.fst in the output folder data/lang_phn_test/
Point your Eesen Transcriber setup to use the resulting graph, for example by setting this value in /vagrant/Makefile.options

GRAPH_DIR?=$(EESEN_ROOT)/asr_egs/tedlium/v2-30ms/lm_build/data/lang_phn_test

Adding your own pronunciations

This process makes use of the CMU Lexicon Tool to generate dictionary entries with phonetic pronunciations for unseen words. These may not always be correct. An alternative approach (Method 3?) Add your own words and pronunciations directly to TEDLIUM.152k.dic first - perhaps pattern matching parts of pronunciations from similar words. It is also possible to have more than one pronunciation, e.g:

zydeco Z AY D EH K OW
zydeco(2) Z IH D AH K OW
zydeco(3) Z AY D AH K OW

FAQ

How some of the scripts work
Deterministic (tiny) LM Building
Adding Technical Words to Dictionary
More Details About LM Building

lm_build's People

Contributors

Stargazers

Watchers

Forkers

silasxue michaelcapizzi clcarwin boystray xiaoyeye1117 vyoz abhi31991 xubuild jiltseb talal-sen xbsdsongnan van-den mbencherif

lm_build's Issues

Missing Script : ctc_compile_dict_token.sh in Utils folder

Hi~
I've noticed code: ../utils/ctc_compile_dict_token.sh in main executing script run_adapt.sh (line 66), but there doesn't exist a script under utils folder named this.

should I cp from somewhere else?

Please help to check this ~

Best regards,

Error establishing a database connection

when I try to open your website by this link ,

http://speechkitchen.org/

I get this error "Error establishing a database connection" is your website down ? could you please fix it ?

regards
Ehsan

Broken link in README.md

Hey,

The link (http://speech-kitchen.org/kaldi-language-model-building/) in your REAMDME.md that shall provide an explanation on how to build an own Kaldi LM seems to be broken.
I'd be happy to have a look at this guide.

Best!

Building Language Model with low amount of data

I'm looking to build a language model with a small amount of text, and for experimental purposes I'm also trying with a small amount of example_txt.

So when I was earlier using the instructions under the heading "Adapting your own Language Model for EESEN-tedlium" here

cd ~/eesen/asr_egs/tedlium/v2-30ms/lm_build
./train_lms.sh example_txt local_lm
cd ..
lm_build/utils/decode_graph_newlm.sh data/lang_phn_test

When I tried with an example text of merely 145 words, it was successful in building a language model but the results were pretty bad. Most of the words in the transcript were from outside the example_txt. So, I tried modifying wordlist.txt to only include about ~90 words which were only the words I expected in the transcript.

I got an error like:
compute_perplexity: no unigram-state weight for predicted word "BA"
(I think it was something other than "BA", "BH" or something... I can find out if it's important)

I played around and realized that the wordlist.txt had to be at least about 47k odd words and that would get rid of the error. So I padded with fake words full of symbols and things were working. (although not as optimally as I'd like)

Since run_adapt.sh seemed to be a better recipe as I wrote about in another discussion, i tried that. Even keeping the original large dictionary, if I just reduced the example_txt to be a small piece of 145 words, it repeatedly gave the message:

"compute_perplexity: for history-state "", no total-count % is seen
(perhaps you didn't put the training n-grams through interpolate_ngrams?)"

and eventually ended with:

"Usage: optimize_alpha.pl alpha1 perplexity@alpha1 alpha2 perplexity@alpha2 alpha3 perplexity@alph3 at /home/vagrant/eesen/tools/kaldi_lm/optimize_alpha.pl line 23.
Expecting files adapt_lm/3gram-mincount//ngrams_disc and adapt_lm/3gram-mincount//../word_map to exist
E.g. see egs/wsj/s3/local/wsj_train_lm.sh for examples.
Finding OOV words in ARPA LM but not in our words.txt
gzip: adapt_lm/3gram-mincount/lm_pr6.0.gz: No such file or directory
Composing the decoding graph using our own ARPA LM
No such file adapt_lm/3gram-mincount/lm_pr6.0.gz"

Any thoughts on why these errors are occurring? (more interested in the run_adapt.sh recipe now)

I guess in our target application, we will have much more data than this, but the example_txt could still be much smaller compared to the original example_txt file (specific and narrow domain). So I guess it's worthwhile for me to understand the problem occurring above beyond the scope of the toy experiment too.

srvk / lm_build Goto Github PK

lm_build's Introduction

Kaldi Language Model Building

Adapting Your Own Language Model

Adding New Vocabulary Words to the Lexicon

Adding your own pronunciations

FAQ

lm_build's People

Contributors

Stargazers

Watchers

Forkers

lm_build's Issues

Missing Script : ctc_compile_dict_token.sh in Utils folder

Error establishing a database connection

Broken link in README.md

Building Language Model with low amount of data

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent