cmusphinx / g2p-seq2seq Goto Github PK

View Code? Open in Web Editor NEW

667.0 667.0 195.0 887 KB

G2P with Tensorflow

License: Other

Python 100.00%

cmudict g2p g2p-seq2seq

g2p-seq2seq's People

Contributors

Stargazers

Watchers

Forkers

g10dras jordi-adell chagge akr24 stevemurr weishirongzhen1 pineking beld clear-datacenter campuslifeceo parvizp piandpower lingz mrreece5 ssp154774273 nigma theolivenbaum wesleydias zhangjiulong digideskio clinzy ipaste runngezhang 1ayelet letitbevi alongwithyou byukan rohitvishal7 yfliao kfriesth pi19404 mavidser shubhamtibra mparker3 sevinjyolchuyeva benjamesbabala seifmostafa kdjyss vagrawal moitreebasu1990 fehiepsi snd96 jonmay lemaoliu namanmakhija hugo-w ml-lab harishreddy05 sxdkxgwan sayint-ai stevenlol dacson dreamseeker1 simpleemotion rarecode msamribeiro colinsongf hellooooworld vijay120 yanchaomars haoqc86 alluneediswill frankgcheung tbinias xushenkun cheneyfan ming-hai viralmehtaswe arushir dachengai gauravsacc zhizhengwu lifefeel mitr-io vikingmew parisilabs oussemaster zhanwenchen longfeiprojects alsora rikrd domitorikor aiedward shahawi afcarl logi-shi ycczhao maxisawesome ycchuang luzeru jupinter linmufeng dzheng256 dansonc fdeng1983 aniyer saikiranvalluri whyxzh entn-at robotjustina

g2p-seq2seq's Issues

Zero accuracy

I run

python /home/ubuntu/g2p-seq2seq/g2p_seq2seq/g2p.py --train cmudict.dict --num_layers 4 --size 64 --model model

I get

WER :  0.964269283852
Accuracy :  0.0357307161478

No need to read vocabulary file to calculate vocabulary size

If vocabulary file is already loaded, do not reread it again.

Fix comment to read_data and rename the method

It does not read anything anymore.

TypeError in g2p-seq2seq --interactive

Hi,
thanks so much for this great project!

I have it running in --decode mode but run into this error when --interactivewhere I receive this message:

$ sudo g2p-seq2seq --interactive --model g2p-seq2seq-cmudict
Creating 2 layers of 512 units.
Reading model parameters from g2p-seq2seq-cmudict
> hello
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.5/bin/g2p-seq2seq", line 11, in <module>
    load_entry_point('g2p-seq2seq==5.0.0a0', 'console_scripts', 'g2p-seq2seq')()
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/g2p_seq2seq-5.0.0a0-py3.5.egg/g2p_seq2seq/app.py", line 78, in main
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/g2p_seq2seq-5.0.0a0-py3.5.egg/g2p_seq2seq/g2p.py", line 308, in interactive
TypeError: decoding str is not supported

Sorry if this is a newbie error. Any help much appreciated :)

g2p-seq2seq --evaluate NEWARABIC/test.wordlist --model NEWARABIC
Creating 2 layers of 64 units.
Reading model parameters from NEWARABIC
Beginning calculation word error rate (WER) on test sample.
Words : 0
Errors: 0
Traceback (most recent call last):
File "/usr/local/bin/g2p-seq2seq", line 9, in
load_entry_point('g2p-seq2seq==5.0.0a0', 'console_scripts', 'g2p-seq2seq')()
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/app.py", line 81, in main
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 348, in evaluate
ZeroDivisionError: float division by zero

When I decode the same wordlist, it works fine.

compare performance with CNTK seq2seq implementation

Would be interesting to compare with a similar CNTK model:
https://github.com/Microsoft/CNTK/blob/master/Examples/SequenceToSequence/Miscellaneous/G2P/G2P.cntk

List index out of range

I am trying to do everything right but this error still persists.

Creating 2 layers of 64 units.
Created model with fresh parameters.
global step 200 learning rate 0.5000 step-time 3.09 perplexity 1.57
Traceback (most recent call last):
File "/usr/local/bin/g2p-seq2seq", line 9, in
load_entry_point('g2p-seq2seq==5.0.0a0', 'console_scripts', 'g2p-seq2seq')()
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/app.py", line 67, in main
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 217, in train
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 253, in __run_evals
File "/usr/local/lib/python2.7/dist-packages/tensorflow/models/rnn/translate/seq2seq_model.py", line 250, in get_batch
encoder_input, decoder_input = random.choice(data[bucket_id])
File "/usr/lib/python2.7/random.py", line 275, in choice
return seq[int(self.random() * len(seq))] # raises IndexError if seq is empty
IndexError: list index out of range

test accuracy with trained model

I want to test accuracy of the trained model on cmudict.
Are there any standard training, validation, test dict for this task.

How it is compared in papers for fair evaluation if there are no standard partitions?
Thanks a lot for this code.

Logic is not clear

  create_vocabulary(ph_vocab_path, train_ph)
  create_vocabulary(gr_vocab_path, train_gr)

  # Initialize vocabularies.
  ph_vocab = initialize_vocabulary(ph_vocab_path, False)
  gr_vocab = initialize_vocabulary(gr_vocab_path, False)

Why do you need to initialize the vocabulary after you created it. Logic must be more straightforward. First initialize the vocabulary, then save it, then there is no need to reload it again.

WER 0.46?

I ran with 2 layers and 512 units but got nowhere close to reported?
Is this execution correct?
python -u g2p.py --train ../../cmudict/cmudict.dict --size 512
Preparing G2P data
Creating vocabularies in /tmp
Creating vocabulary /tmp/vocab.phoneme
Creating vocabulary /tmp/vocab.grapheme
Reading development and training data.
Creating 2 layers of 512 units.
Reading model parameters from /tmp/translate.ckpt-200
global step 400 learning rate 0.5000 step-time 2.78 perplexity 7.83
eval: bucket 0 perplexity 4.88
eval: bucket 1 perplexity 6.30
eval: bucket 2 perplexity 12.34
global step 600 learning rate 0.5000 step-time 2.71 perplexity 4.34
eval: bucket 0 perplexity 2.48
eval: bucket 1 perplexity 2.96
eval: bucket 2 perplexity 4.78
global step 800 learning rate 0.5000 step-time 2.63 perplexity 2.72
eval: bucket 0 perplexity 1.75
eval: bucket 1 perplexity 2.15
eval: bucket 2 perplexity 3.45
global step 1000 learning rate 0.5000 step-time 2.56 perplexity 2.26
eval: bucket 0 perplexity 1.65
eval: bucket 1 perplexity 1.84
eval: bucket 2 perplexity 3.17
global step 1200 learning rate 0.5000 step-time 2.68 perplexity 2.00
eval: bucket 0 perplexity 1.29
eval: bucket 1 perplexity 1.69
eval: bucket 2 perplexity 2.57
global step 1400 learning rate 0.5000 step-time 2.86 perplexity 1.84
eval: bucket 0 perplexity 1.48
eval: bucket 1 perplexity 1.70
eval: bucket 2 perplexity 2.15
global step 1600 learning rate 0.5000 step-time 3.40 perplexity 1.76
eval: bucket 0 perplexity 1.65
eval: bucket 1 perplexity 1.67
eval: bucket 2 perplexity 2.18
global step 1800 learning rate 0.5000 step-time 3.65 perplexity 1.71
eval: bucket 0 perplexity 1.42
eval: bucket 1 perplexity 1.79
eval: bucket 2 perplexity 2.04
global step 2000 learning rate 0.5000 step-time 2.68 perplexity 1.56
eval: bucket 0 perplexity 1.30
eval: bucket 1 perplexity 1.53
eval: bucket 2 perplexity 1.83
global step 2200 learning rate 0.5000 step-time 3.33 perplexity 1.61
eval: bucket 0 perplexity 1.50
eval: bucket 1 perplexity 1.66
eval: bucket 2 perplexity 1.70
global step 2400 learning rate 0.5000 step-time 3.01 perplexity 1.52
eval: bucket 0 perplexity 1.29
eval: bucket 1 perplexity 1.47
eval: bucket 2 perplexity 1.79
global step 2600 learning rate 0.5000 step-time 3.09 perplexity 1.53
eval: bucket 0 perplexity 1.34
eval: bucket 1 perplexity 1.57
eval: bucket 2 perplexity 1.90
global step 2800 learning rate 0.5000 step-time 2.92 perplexity 1.49
eval: bucket 0 perplexity 1.35
eval: bucket 1 perplexity 1.67
eval: bucket 2 perplexity 1.85
global step 3000 learning rate 0.5000 step-time 2.82 perplexity 1.44
eval: bucket 0 perplexity 1.39
eval: bucket 1 perplexity 1.55
eval: bucket 2 perplexity 1.81
global step 3200 learning rate 0.5000 step-time 2.68 perplexity 1.43
eval: bucket 0 perplexity 1.49
eval: bucket 1 perplexity 1.35
eval: bucket 2 perplexity 1.87
global step 3400 learning rate 0.5000 step-time 2.90 perplexity 1.41
eval: bucket 0 perplexity 1.35
eval: bucket 1 perplexity 1.56
eval: bucket 2 perplexity 1.73
global step 3600 learning rate 0.5000 step-time 2.79 perplexity 1.40
eval: bucket 0 perplexity 1.27
eval: bucket 1 perplexity 1.32
eval: bucket 2 perplexity 1.59
global step 3800 learning rate 0.5000 step-time 2.87 perplexity 1.38
eval: bucket 0 perplexity 1.52
eval: bucket 1 perplexity 1.46
eval: bucket 2 perplexity 1.52
global step 4000 learning rate 0.5000 step-time 2.74 perplexity 1.36
eval: bucket 0 perplexity 1.49
eval: bucket 1 perplexity 1.41
eval: bucket 2 perplexity 1.83
global step 4200 learning rate 0.5000 step-time 2.80 perplexity 1.37
eval: bucket 0 perplexity 1.23
eval: bucket 1 perplexity 1.36
eval: bucket 2 perplexity 1.58
global step 4400 learning rate 0.5000 step-time 2.94 perplexity 1.36
eval: bucket 0 perplexity 1.58
eval: bucket 1 perplexity 1.53
eval: bucket 2 perplexity 1.73
global step 4600 learning rate 0.5000 step-time 3.16 perplexity 1.35
eval: bucket 0 perplexity 1.25
eval: bucket 1 perplexity 1.54
eval: bucket 2 perplexity 1.58
global step 4800 learning rate 0.5000 step-time 2.74 perplexity 1.33
eval: bucket 0 perplexity 1.44
eval: bucket 1 perplexity 1.60
eval: bucket 2 perplexity 1.72
global step 5000 learning rate 0.5000 step-time 2.77 perplexity 1.33
eval: bucket 0 perplexity 1.36
eval: bucket 1 perplexity 1.38
eval: bucket 2 perplexity 1.60
global step 5200 learning rate 0.5000 step-time 2.97 perplexity 1.32
eval: bucket 0 perplexity 1.29
eval: bucket 1 perplexity 1.41
eval: bucket 2 perplexity 1.66
global step 5400 learning rate 0.5000 step-time 2.77 perplexity 1.30
eval: bucket 0 perplexity 1.31
eval: bucket 1 perplexity 1.52
eval: bucket 2 perplexity 1.45
global step 5600 learning rate 0.5000 step-time 2.80 perplexity 1.30
eval: bucket 0 perplexity 1.31
eval: bucket 1 perplexity 1.28
eval: bucket 2 perplexity 1.75
global step 5800 learning rate 0.5000 step-time 2.64 perplexity 1.29
eval: bucket 0 perplexity 1.42
eval: bucket 1 perplexity 1.33
eval: bucket 2 perplexity 1.41
global step 6000 learning rate 0.5000 step-time 2.76 perplexity 1.28
eval: bucket 0 perplexity 1.26
eval: bucket 1 perplexity 1.39
eval: bucket 2 perplexity 1.48
global step 6200 learning rate 0.5000 step-time 2.55 perplexity 1.28
eval: bucket 0 perplexity 1.37
eval: bucket 1 perplexity 1.37
eval: bucket 2 perplexity 1.67
global step 6400 learning rate 0.5000 step-time 2.68 perplexity 1.26
eval: bucket 0 perplexity 1.23
eval: bucket 1 perplexity 1.50
eval: bucket 2 perplexity 1.44
global step 6600 learning rate 0.5000 step-time 2.98 perplexity 1.26
eval: bucket 0 perplexity 1.12
eval: bucket 1 perplexity 1.54
eval: bucket 2 perplexity 1.47
global step 6800 learning rate 0.5000 step-time 2.87 perplexity 1.26
eval: bucket 0 perplexity 1.22
eval: bucket 1 perplexity 1.29
eval: bucket 2 perplexity 1.56
global step 7000 learning rate 0.5000 step-time 2.81 perplexity 1.26
eval: bucket 0 perplexity 1.22
eval: bucket 1 perplexity 1.45
eval: bucket 2 perplexity 1.54
global step 7200 learning rate 0.5000 step-time 2.76 perplexity 1.25
eval: bucket 0 perplexity 1.35
eval: bucket 1 perplexity 1.46
eval: bucket 2 perplexity 1.40
global step 7400 learning rate 0.5000 step-time 3.06 perplexity 1.24
eval: bucket 0 perplexity 1.18
eval: bucket 1 perplexity 1.26
eval: bucket 2 perplexity 1.48
global step 7600 learning rate 0.5000 step-time 3.15 perplexity 1.25
eval: bucket 0 perplexity 1.47
eval: bucket 1 perplexity 1.31
eval: bucket 2 perplexity 1.50
global step 7800 learning rate 0.5000 step-time 3.13 perplexity 1.24
eval: bucket 0 perplexity 1.50
eval: bucket 1 perplexity 1.43
eval: bucket 2 perplexity 1.46
global step 8000 learning rate 0.5000 step-time 2.76 perplexity 1.23
eval: bucket 0 perplexity 1.39
eval: bucket 1 perplexity 1.37
eval: bucket 2 perplexity 1.47
global step 8200 learning rate 0.5000 step-time 2.64 perplexity 1.22
eval: bucket 0 perplexity 1.30
eval: bucket 1 perplexity 1.25
eval: bucket 2 perplexity 1.59
global step 8400 learning rate 0.5000 step-time 2.38 perplexity 1.23
eval: bucket 0 perplexity 1.42
eval: bucket 1 perplexity 1.43
eval: bucket 2 perplexity 1.45
global step 8600 learning rate 0.5000 step-time 2.53 perplexity 1.21
eval: bucket 0 perplexity 1.42
eval: bucket 1 perplexity 1.33
eval: bucket 2 perplexity 1.39
global step 8800 learning rate 0.5000 step-time 2.58 perplexity 1.21
eval: bucket 0 perplexity 1.21
eval: bucket 1 perplexity 1.31
eval: bucket 2 perplexity 1.50
global step 9000 learning rate 0.5000 step-time 2.88 perplexity 1.21
eval: bucket 0 perplexity 1.36
eval: bucket 1 perplexity 1.30
eval: bucket 2 perplexity 1.57
global step 9200 learning rate 0.5000 step-time 3.03 perplexity 1.21
eval: bucket 0 perplexity 1.47
eval: bucket 1 perplexity 1.45
eval: bucket 2 perplexity 1.38
global step 9400 learning rate 0.5000 step-time 2.77 perplexity 1.20
eval: bucket 0 perplexity 1.39
eval: bucket 1 perplexity 1.29
eval: bucket 2 perplexity 1.55
global step 9600 learning rate 0.5000 step-time 2.86 perplexity 1.19
eval: bucket 0 perplexity 1.53
eval: bucket 1 perplexity 1.35
eval: bucket 2 perplexity 1.46
global step 9800 learning rate 0.5000 step-time 2.87 perplexity 1.19
eval: bucket 0 perplexity 1.43
eval: bucket 1 perplexity 1.43
eval: bucket 2 perplexity 1.80
global step 10000 learning rate 0.5000 step-time 2.74 perplexity 1.18
eval: bucket 0 perplexity 1.36
eval: bucket 1 perplexity 1.50
eval: bucket 2 perplexity 1.45
Training process stopped.
Beginning calculation word error rate (WER) on test sample.
WER : 0.469490521327
Accuracy : 0.530509478673

Create model folder automatically

Currently have to create model direrctory manually

Where is the Train Model file?

I used the following command to train G2P model:
python g2p.py --train /home/cmudict.dict --model /home/MyModel --max_steps 8400

here is the log:

Preparing G2P data
Creating vocabularies in /home/MyModel
Creating vocabulary /home/MyModel/vocab.phoneme
Creating vocabulary /home/MyModel/vocab.grapheme
Reading development and training data.
Creating 2 layers of 64 units.
........
Reading model parameters from /home/MyModel/translate.ckpt-8200
global step 8400 learning rate 0.4901 step-time 3.43 perplexity 1.37
  eval: bucket 0 perplexity 1.46
  eval: bucket 1 perplexity 1.29
  eval: bucket 2 perplexity 1.47
Training process stopped.
Beginning calculation word error rate (WER) on test sample.
WER :  0.4961492891
Accuracy :  0.5038507109

In MyModel directory there are so many generated files present, but there is no "model" file.

translate.ckpt-200
translate.ckpt-200.meta
translate.ckpt-400
translate.ckpt-400.meta
translate.ckpt-600
translate.ckpt-600.meta
translate.ckpt-7200
translate.ckpt-7200.meta
translate.ckpt-7400
translate.ckpt-7400.meta
translate.ckpt-7600
translate.ckpt-7600.meta
translate.ckpt-7800
translate.ckpt-7800.meta
translate.ckpt-8000
translate.ckpt-8000.meta
translate.ckpt-8200
translate.ckpt-8200.meta
translate.ckpt-8400
translate.ckpt-8400.meta
model.params
vocab.phoneme
vocab.grapheme
translate.ckpt-8600
translate.ckpt-8600.meta
translate.ckpt-8800
checkpoint
translate.ckpt-8800.meta

Where to get that "model" file.
or do I have to rename file translate.ckpt-8800 to model ?

Time to restore saved model?

In the function g2p.py I added a time.time() function around the command

self.model.saver.restore(self.session, os.path.join(self.model_dir,
"model"))

to see how long it takes to load a pre-trained model to decode words. With a model trained with 512 nodes I get:

Time to load model: 2.53336596489

with only 64 nodes I don't get much savings:

Time to load model: 2.50763916969

which according to the python time module is output in seconds. That seems really slow. I am using the cpu instead of the gpu, because in the end if we are to include a similar NN model in our software, we don't have any gpu power on our servers. But still, when I compare it with a current openfst implementation of an n-gram model, that one is only 300ms or 0.3s to load in c++.

It may be faster if I can restore the saved file from c++ but I have to see about writing code to allow that.

Do not check if file exists if you need to open it

File might disappear between exists check and open anyway, so check is redundant. Just open the files and proceed. Throw an error if open failed.

Throw error if model path does not exist for interactive / decoding mode

Test frequency-based g2p

Like we discussed, what if we bias word pronunciation with word frequency

training model did not improve accuracy!

So, I was training a new model based on CMUSphinx dictionary, with --max_steps 10000 --size 512 --num_layers 3 --learning_rate 0.5 variables, after I finished the training on model, I got this output with trained model.

a
M HH HH HH HH UH UH UH UH UH
b
M M UH UH UH UH UH UH UH UH
c
M M M UH UH UH UH UH UH UH
d
M M M M UH UH UH UH UH UH
hello
M HH HH HH HH UW UW UW UW M M M M M M
aa
HH HH HH HH HH HH HH HH UH UH

Is there anything wrong with my approach ?

this was my last output in training model.
global step 10000 learning rate 0.4000 step-time 3.48 perplexity 1.15
eval: bucket 0 perplexity 1.40
eval: bucket 1 perplexity 1.25
eval: bucket 2 perplexity 1.34

Fix bad code patterns

Never convert integer to string to later convert it back to integer, this is very inefficient.

Never join list items in a string to later split them and join them again.

Remove the code which is not used.

PER?

Is it possible to get phone error rate in addition to word error rate?

Grapheme-phoneme alignment output

Hi,

is it possible to output grapheme-phoneme alignment data from this model?

many thanks,

Daniel

Check size for embedding layer

When you convert letter and phoneme symbols to numerical ids, isn't it confusing for the model to train with integers for classes? Would it be better to have one-hot encoding or maybe even letter embeddings to make distances between letters or phonemes more meaningful?

Review training stop criteria

Many steps should have the same perplexity before training ends. The number of steps thus could be reduced significantly if we stop when we have the same perplexity 4 times or so. It should not affect the accuracy.

Command arguments in README

Square brackets like "[model_folder_path]" are reserved for optional arguments, required arguments are usually simply underlined

http://stackoverflow.com/questions/9725675/is-there-a-standard-format-for-command-line-shell-help-text

Store model architecture in a text file

Remove sys.stdout.flush()

Remove duplicated code

There is no need to use two times

print("> ", end="")

Print percentage on evaluation

With 2 digits after dot

Train/test split which takes pronunciation variants into account

It should not split pronunciation variants for same word for train and test

Running question for this command(g2p-seq2seq --interactive --model model_folder_path)

sam@speechws13:~/g2p-seq2seq-master$ g2p-seq2seq --interactive --model g2p-seq2seq-cmudict/g2p-seq2seq-cmudict/modle
Traceback (most recent call last):
  File "/usr/local/bin/g2p-seq2seq", line 9, in <module>
    load_entry_point('g2p-seq2seq==5.0.0a0', 'console_scripts', 'g2p-seq2seq')()
  File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 542, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2569, in load_entry_point
    return ep.load()
  File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2229, in load
    return self.resolve()
  File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2235, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "build/bdist.linux-x86_64/egg/g2p_seq2seq/__init__.py", line 23, in <module>
  File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 36, in <module>
ImportError: No module named data_utils

how can I fix this question??
thank you~

Cleanup main function

Move

    train_gr, train_ph = data_utils.split_to_grapheme_phoneme(train_dic)
    valid_gr, valid_ph = data_utils.split_to_grapheme_phoneme(valid_dic)
    test_gr, test_ph = data_utils.split_to_grapheme_phoneme(test_dic)

from the main function to train

Any plan to replace online lmtool with g2p-seq2seq ?

Is there any plan to replace online lmtool with g2p-seq2seq or start a new service like that ?
http://www.speech.cs.cmu.edu/tools/lmtool-new.html

Last word in the list is skipped

[shmyrev@alpha g2p_seq2seq]$ cat > word list
hello
world
how
are 
you
[shmyrev@alpha g2p_seq2seq]$ python g2p.py --model /home/shmyrev/cmudict-g2p-model --decode word.list
HH EH L OW
W ER L D
HH AW
AA R

Last word is missing

Moreover, each line should contain a word, not just the phonemes. It should create a ready-to-use dictionary:

[shmyrev@alpha g2p_seq2seq]$ python g2p.py --model /home/shmyrev/cmudict-g2p-model --decode word.list
hello HH EH L OW
world W ER L D
how HH AW
are AA R
you Y UW

Check code with pylint

https://google.github.io/styleguide/pyguide.html

Merge get_vocabs and load_model

To make them read vocabulary file only once.

coding problem

Dear All,

I got a coding error in test phase (training and interactive phase were all fine). My training dictionary is a mixture of cmudict (ascii) and Chinese (utf-8) lexicons. What should I do? Should I convert all cmudict entries to utf-8?

Thanks a lot in advance!

Here is the log:

global step 91200 learning rate 0.2425 step-time 0.13 perplexity 1.02
Training done.
Creating 2 layers of 512 units.
Reading model parameters from g2p-seq2seq-oc16
Beginning calculation word error rate (WER) on test sample.
Traceback (most recent call last):
File "/home/liao/anaconda3/envs/python2.7/bin/g2p-seq2seq", line 9, in
load_entry_point('g2p-seq2seq==5.0.0a0', 'console_scripts', 'g2p-seq2seq')()
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/app.py", line 67, in main
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 234, in train
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 347, in evaluate
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 323, in calc_error
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 279, in decode_word
UnicodeEncodeError: 'ascii' codec can't encode character u'\u86c8' in position 9: ordinal not in range(128)

Here is my training dictionary:

瘦西湖 sh ou4 x i1 h u2
睃 s uo1
supercuts S UW1 P ER0 K AH2 T S
电报机 d ian4 b ao4 j i1
galka G AE1 L K AH0
知 zh ix4
Unipus Y UW1 N IH0 P AH0 S

Move symbol check inside decode_word function

You can raise exceptions only in one place, now you check for extra symbols in several places

Include Apache license

Fix LICENSE.txt

Sort phoneme and grapheme vocabularies

To make them dictionary-independent

Training does not report test dictionary accuracy

Test dictionary is simply ignored

Provide alternate pronunciations

whether we can get alternate pronunciations

Avoid redundant dictionary construction

If you only need direct and reversed dictionary, it is better to change this method:

def initialize_vocabulary(vocabulary_path):
  """Initialize vocabulary from file.
  We assume the vocabulary is stored one-item-per-line, so a file:
    d
    c
  will result in a vocabulary {"d": 0, "c": 1}, and this function will
  also return the reversed-vocabulary ["d", "c"].

To this method with optional reverse parameter:

def load_vocab(vocabulary_path, reverse = False)

This method should return only one vocabulary direct or reversed based on optional flag

Train with short dictionaries

Traceback (most recent call last):
  File "g2p.py", line 442, in <module>
    tf.app.run()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/default/_app.py", line 30, in run
    sys.exit(main(sys.argv))
  File "g2p.py", line 425, in main
    g2p_model.train(g2p_params, FLAGS.train, FLAGS.valid, FLAGS.test)
  File "g2p.py", line 243, in train
    self.__run_evals()
  File "g2p.py", line 269, in __run_evals
    self.valid_set, bucket_id)
  File "/usr/lib/python2.7/site-packages/tensorflow/models/rnn/translate/seq2seq_model.py", line 252, in get_batch
    encoder_input, decoder_input = random.choice(data[bucket_id])
  File "/usr/lib64/python2.7/random.py", line 274, in choice
    return seq[int(self.random() * len(seq))]  # raises IndexError if seq is empty
IndexError: list index out of range

Can't train model

Can't train model
Words pronunciation are wrong (using default CMUSphinx G2P model)

No need to two-pass loops

Here you can do with a single pass, and there is not need for list

  lst = []
  for line in inp_dictionary:
    lst.append(line.strip().split())

  graphemes, phonemes = [], []
  for line in lst:
    if len(line)>1:
      graphemes.append(list(line[0]))
      phonemes.append(line[1:])