gooofy / zamia-speech Goto Github PK

Open tools and data for cloudless automatic speech recognition

License: GNU Lesser General Public License v3.0

Makefile 0.07% Shell 17.78% Perl 7.41% Python 71.83% CSS 0.16% JavaScript 0.07% Pug 2.68%

kaldi speech-corpora voxforge sequitur lexicon cmu-sphinx language-model asr speech-recognition

zamia-speech's Introduction

Zamia Speech

Python scripts to compute audio and language models from voxforge.org speech data and many sources. Models that can be built include:

Kaldi nnet3 chain audio models
KenLM language models in ARPA format
sequitur g2p models
wav2letter++ models

Important: Please note that these scripts form in no way a complete application ready for end-user consumption. However, if you are a developer interested in natural language processing you may find some of them useful. Contributions, patches and pull requests are very welcome.

At the time of this writing, the scripts here are focused on building the English and German VoxForge models. However, there is no reason why they couldn't be used to build other language models as well, feel free to contribute support for those.

Zamia Speech
Table of Contents
Download
- ASR Models
- IPA Dictionaries (Lexicons)
- G2P Models
- Language Models
- Code
Get Started with our Pre-Trained Models
- Run Example Applications
  - Wave File Decoding Demo
  - Live Mic Demo
Get Started with a Demo STT Service Packaged in Docker
Requirements
Setup Notes
- ~/.speechrc
- tmp directory
Speech Corpora
- Adding Artificial Noise or Other Effects
Text Corpora
Language Model
- English
- German
- French
Submission Review and Transcription
Lexica/Dictionaries
- Sequitur G2P
- Manual Editing
- Wiktionary
Kaldi Models (recommended)
- English NNet3 Chain Models
- German NNet3 Chain Models
- Model Adaptation
wav2letter models
- English Wav2letter Models
- German Wav2letter Models
- auto-reviews using wav2letter
Audiobook Segmentation and Transcription (Manual)
- (0/3) Convert Audio to WAVE Format
- (1/3) Convert Audio to 16kHz mono
- (2/3) Split Audio into Segments
- (3/3) Transcribe Audio
Audiobook Segmentation and Transcription (kaldi)
- Directory Layout
- (1/4) Preprocess the Transcript
- (2/4) Model adaptation
- (3/4) Auto-Segment using Kaldi
- (4/4) Retrieve Segmentation Result
Training Voices for Zamia-TTS
- Tacotron 2
- Tacotron
Model Distribution
License
Authors

Created by gh-md-toc

Download

We have various models plus source code and binaries for the tools used to build these models available for download. Everything is free and open source.

All our model and data downloads can be found here: Downloads

ASR Models

Our pre-built ASR models can be downloaded here: ASR Models

Kaldi ASR, English:
- kaldi-generic-en-tdnn_f Large nnet3-chain factorized TDNN model, trained on ~1200 hours of audio. Has decent background noise resistance and can also be used on phone recordings. Should provide the best accuracy but is a bit more resource intensive than the other models.
- kaldi-generic-en-tdnn_sp Large nnet3-chain model, trained on ~1200 hours of audio. Has decent background noise resistance and can also be used on phone recordings. Less accurate but also slightly less resource intensive than the tddn_f model.
- kaldi-generic-en-tdnn_250 Same as the larger models but less resource intensive, suitable for use in embedded applications (e.g. a RaspberryPi 3).
- kaldi-generic-en-tri2b_chain GMM Model, trained on the same data as the above two models - meant for auto segmentation tasks.
Kaldi ASR, German:
- kaldi-generic-de-tdnn_f Large nnet3-chain model, trained on ~400 hours of audio. Has decent background noise resistance and can also be used on phone recordings.
- kaldi-generic-de-tdnn_250 Same as the large model but less resource intensive, suitable for use in embedded applications (e.g. a RaspberryPi 3).
- kaldi-generic-de-tri2b_chain GMM Model, trained on the same data as the above two models - meant for auto segmentation tasks.
wav2letter++, German:
- w2l-generic-de Large model, trained on ~400 hours of audio. Has decent background noise resistance and can also be used on phone recordings.

NOTE: It is important to realize that these models can and should be adapted to your application domain. See Model Adaptation for details.

IPA Dictionaries (Lexicons)

Our dictionaries can be downloaded here: Dictionaries

IPA UTF-8, English:
- dict-en.ipa Based on CMUDict with many additional entries generated via Sequitur G2P.
IPA UTF-8, German:
- dict-de.ipa Created manually from scratch with many additional auto-reviewed entries extracted from Wiktionary.

G2P Models

Our pre-built G2P models can be downloaded here: G2P Models

Sequitur, English:
- sequitur-dict-en.ipa Sequitur G2P model trained on our English IPA dictionary (UTF8).
Sequitur, German:
- sequitur-dict-de.ipa Sequitur G2P model trained on our German IPA dictionary (UTF8).

Language Models

Our pre-built ARPA language models can be downloaded here: Language Models

KenLM, order 4, English, ARPA:
- generic_en_lang_model_small
KenLM, order 6, English, ARPA:
- generic_en_lang_model_large
KenLM, order 4, German, ARPA:
- generic_de_lang_model_small
KenLM, order 6, German, ARPA:
- generic_de_lang_model_large

Code

Zamia-Speech where we host all our scripts and other sources used to build our models.
py-kaldi-asr Python wrapper around Kaldi's nnet3-chain decoder complete with example scripts on how to use our models in your application.
Binary AI Packages
- Raspbian APT Repo Binary packages in Debian format for Raspbian 9 (stretch, armhf, Raspberry Pi 2/3)
- Debian APT Repo Binary packages in Debian format for Debian 9 (stretch, amd64)
- CentOS YUM Repo Binary packages in RPM format for CentOS 7 (x86_64)
Source AI Packages
- CentOS 7 Source packages in SRPM format for CentOS 7

Get Started with our Pre-Trained Models

Run Example Applications

Wave File Decoding Demo

Download a few sample wave files

$ wget http://goofy.zamia.org/zamia-speech/misc/demo_wavs.tgz
--2018-06-23 16:46:28--  http://goofy.zamia.org/zamia-speech/misc/demo_wavs.tgz
Resolving goofy.zamia.org (goofy.zamia.org)... 78.47.65.20
Connecting to goofy.zamia.org (goofy.zamia.org)|78.47.65.20|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 619852 (605K) [application/x-gzip]
Saving to: ‘demo_wavs.tgz’

demo_wavs.tgz                     100%[==========================================================>] 605.32K  2.01MB/s    in 0.3s    

2018-06-23 16:46:28 (2.01 MB/s) - ‘demo_wavs.tgz’ saved [619852/619852]

unpack them:

$ tar xfvz demo_wavs.tgz
demo1.wav
demo2.wav
demo3.wav
demo4.wav

download the demo program

$ wget http://goofy.zamia.org/zamia-speech/misc/kaldi_decode_wav.py
--2018-06-23 16:47:53--  http://goofy.zamia.org/zamia-speech/misc/kaldi_decode_wav.py
Resolving goofy.zamia.org (goofy.zamia.org)... 78.47.65.20
Connecting to goofy.zamia.org (goofy.zamia.org)|78.47.65.20|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2469 (2.4K) [text/plain]
Saving to: ‘kaldi_decode_wav.py’

kaldi_decode_wav.py               100%[==========================================================>]   2.41K  --.-KB/s    in 0s      

2018-06-23 16:47:53 (311 MB/s) - ‘kaldi_decode_wav.py’ saved [2469/2469]

now run kaldi automatic speech recognition on the demo wav files:

$ python kaldi_decode_wav.py -v demo?.wav
DEBUG:root:/opt/kaldi/model/kaldi-generic-en-tdnn_sp loading model...
DEBUG:root:/opt/kaldi/model/kaldi-generic-en-tdnn_sp loading model... done, took 1.473226s.
DEBUG:root:/opt/kaldi/model/kaldi-generic-en-tdnn_sp creating decoder...
DEBUG:root:/opt/kaldi/model/kaldi-generic-en-tdnn_sp creating decoder... done, took 0.143928s.
DEBUG:root:demo1.wav decoding took     0.37s, likelyhood: 1.863645
i cannot follow you she said 
DEBUG:root:demo2.wav decoding took     0.54s, likelyhood: 1.572326
i should like to engage just for one whole life in that 
DEBUG:root:demo3.wav decoding took     0.42s, likelyhood: 1.709773
philip knew that she was not an indian 
DEBUG:root:demo4.wav decoding took     1.06s, likelyhood: 1.715135
he also contented that better confidence was established by carrying no weapons

Live Mic Demo

Determine the name of your pulseaudio mic source:

$ pactl list sources
Source #0
    State: SUSPENDED
    Name: alsa_input.usb-C-Media_Electronics_Inc._USB_PnP_Sound_Device-00.analog-mono
    Description: CM108 Audio Controller Analog Mono
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

download and run demo:

$ wget 'http://goofy.zamia.org/zamia-speech/misc/kaldi_decode_live.py'

$ python kaldi_decode_live.py -s 'CM108'
Kaldi live demo V0.2
Loading model from /opt/kaldi/model/kaldi-generic-en-tdnn_250 ...
Please speak.
hallo computer                      
switch on the radio please                      
please switch on the light                      
what about the weather in stuttgart                     
how are you                      
thank you                      
good bye

Get Started with a Demo STT Service Packaged in Docker

To start the STT service on your local machine, execute:

$ docker pull quay.io/mpuels/docker-py-kaldi-asr-and-model:kaldi-generic-en-tdnn_sp-r20180611
$ docker run --rm -p 127.0.0.1:8080:80/tcp quay.io/mpuels/docker-py-kaldi-asr-and-model:kaldi-generic-en-tdnn_sp-r20180611

To transfer an audio file for transcription to the service, in a second terminal, execute:

$ git clone https://github.com/mpuels/docker-py-kaldi-asr-and-model.git
$ conda env create -f environment.yml
$ source activate py-kaldi-asr-client
$ ./asr_client.py asr.wav
INFO:root: 0.005s:  4000 frames ( 0.250s) decoded, status=200.
...
INFO:root:19.146s: 152000 frames ( 9.500s) decoded, status=200.
INFO:root:27.136s: 153003 frames ( 9.563s) decoded, status=200.
INFO:root:*****************************************************************
INFO:root:** wavfn         : asr.wav
INFO:root:** hstr          : speech recognition system requires training where individuals to exercise political system
INFO:root:** confidence    : -0.578844
INFO:root:** decoding time :    27.14s
INFO:root:*****************************************************************

The Docker image in the example above is the result of stacking 4 images on top of each other:

docker-py-kaldi-asr-and-model: Source, Image
docker-py-kaldi-asr: Source, Image
docker-kaldi-asr: Source, Image
debian:8: https://hub.docker.com/_/debian/

Requirements

Note: probably incomplete.

Python 2.7 with nltk, numpy, ...
KenLM
kaldi
wav2letter++
py-nltools
sox
ffmpeg

Dependencies installation example for Debian:

apt-get install build-essential pkg-config python-pip python-dev python-setuptools python-wheel ffmpeg sox libatlas-base-dev

# Create a symbolic link because one of the pip packages expect atlas in this location: 
ln -s /usr/include/x86_64-linux-gnu/atlas /usr/include/atlas

pip install numpy nltk cython
pip install py-kaldi-asr py-nltools

Setup Notes

Just some rough notes on the environment needed to get these scripts to run. This is in no way a complete set of instructions, just some hints to get you started.

`~/.speechrc`

[speech]
vf_login              = <your voxforge login>

speech_arc            = /home/bofh/projects/ai/data/speech/arc
speech_corpora        = /home/bofh/projects/ai/data/speech/corpora

kaldi_root            = /apps/kaldi-cuda

; facebook's wav2letter++
w2l_env_activate      = /home/bofh/projects/ai/w2l/bin/activate
w2l_train             = /home/bofh/projects/ai/w2l/src/wav2letter/build/Train
w2l_decoder           = /home/bofh/projects/ai/w2l/src/wav2letter/build/Decoder

wav16                 = /home/bofh/projects/ai/data/speech/16kHz
noise_dir             = /home/bofh/projects/ai/data/speech/corpora/noise

europarl_de           = /home/bofh/projects/ai/data/corpora/de/europarl-v7.de-en.de
parole_de             = /home/bofh/projects/ai/data/corpora/de/German Parole Corpus/DE_Parole/

europarl_en           = /home/bofh/projects/ai/data/corpora/en/europarl-v7.de-en.en
cornell_movie_dialogs = /home/bofh/projects/ai/data/corpora/en/cornell_movie_dialogs_corpus
web_questions         = /home/bofh/projects/ai/data/corpora/en/WebQuestions
yahoo_answers         = /home/bofh/projects/ai/data/corpora/en/YahooAnswers

europarl_fr           = /home/bofh/projects/ai/data/corpora/fr/europarl-v7.fr-en.fr
est_republicain       = /home/bofh/projects/ai/data/corpora/fr/est_republicain.txt

wiktionary_de         = /home/bofh/projects/ai/data/corpora/de/dewiktionary-20180320-pages-meta-current.xml

[tts]
host                  = localhost
port                  = 8300

tmp directory

Some scripts expect al local tmp directory to be present, located in the same directory where all the scripts live, i.e.

mkdir tmp

Speech Corpora

The following list contains speech corpora supported by this script collection.

Forschergeist (German, 2 hours):
- Download all .tgz files into the directory <~/.speechrc:speech_arc>/forschergeist
- unpack them into the directory <~/.speechrc:speech_corpora>/forschergeist
German Speechdata Package Version 2 (German, 148 hours):
- Unpack the archive such that the directories dev, test, and train are direct subdirectories of <~/.speechrc:speech_arc>/gspv2.
- Then run run the script ./import_gspv2.py to convert the corpus to the VoxForge format. The resulting corpus will be written to <~/.speechrc:speech_corpora>/gspv2.
Noise:
- Download the tarball
- unpack it into the directory <~/.speechrc:speech_corpora>/ (it will generate a noise subdirectory there)
LibriSpeech ASR (English, 475 hours):
- Download the set of 360 hours "clean" speech tarball
- Unpack the archive such that the directory LibriSpeech is a direct subdirectory of <~/.speechrc:speech_arc>.
- Then run run the script ./import_librispeech.py to convert the corpus to the VoxForge format. The resulting corpus will be written to <~/.speechrc:speech_corpora>/librispeech.
The LJ Speech Dataset (English, 24 hours):
- Download the tarball
- Unpack the archive such that the directory LJSpeech-1.1 is a direct subdirectory of <~/.speechrc:speech_arc>.
- Then run run the script import_ljspeech.py to convert the corpus to the VoxForge format. The resulting corpus will be written to <~/.speechrc:speech_corpora>/lindajohnson-11.
Mozilla Common Voice German (German, 140 hours):
- Download de.tar.gz
- Unpack the archive such that the directory cv_de is a direct subdirectory of <~/.speechrc:speech_arc>.
- Then run run the script ./import_mozde.py to convert the corpus to the VoxForge format. The resulting corpus will be written to <~/.speechrc:speech_corpora>/cv_de.
Mozilla Common Voice V1 (English, 252 hours):
- Download cv_corpus_v1.tar.gz
- Unpack the archive such that the directory cv_corpus_v1 is a direct subdirectory of <~/.speechrc:speech_arc>.
- Then run run the script ./import_mozcv1.py to convert the corpus to the VoxForge format. The resulting corpus will be written to <~/.speechrc:speech_corpora>/cv_corpus_v1.
Munich Artificial Intelligence Laboratories GmbH (M-AILABS) Speech Dataset (English, 147 hours, German, 237 hours, French, 190 hours):
- Download de_DE.tgz, en_UK.tgz, en_US.tgz, fr_FR.tgz (Mirror)
- Create a subdirectory m_ailabs in <~/.speechrc:speech_arc>
- Unpack the downloaded tarbals inside the m_ailabs subdirectory
- For French, create a directory by_book and move male and female directories in it as the archive does not follow exactly English and German structures
- Then run run the script ./import_mailabs.py to convert the corpus to the VoxForge format. The resulting corpus will be written to <~/.speechrc:speech_corpora>/m_ailabs_en, <~/.speechrc:speech_corpora>/m_ailabs_de and <~/.speechrc:speech_corpora>/m_ailabs_fr.
TED-LIUM Release 3 (English, 210 hours):
- Download TEDLIUM_release-3.tgz
- Unpack the archive such that the directory TEDLIUM_release-3 is a direct subdirectory of <~/.speechrc:speech_arc>.
- Then run run the script ./import_tedlium3.py to convert the corpus to the VoxForge format. The resulting corpus will be written to <~/.speechrc:speech_corpora>/tedlium3.
VoxForge (English, 75 hours):
- Download all .tgz files into the directory <~/.speechrc:speech_arc>/voxforge_en
- unpack them into the directory <~/.speechrc:speech_corpora>/voxforge_en
VoxForge (German, 56 hours):
- Download all .tgz files into the directory <~/.speechrc:speech_arc>/voxforge_de
- unpack them into the directory <~/.speechrc:speech_corpora>/voxforge_de
VoxForge (French, 140 hours):
- Download all .tgz files into the directory <~/.speechrc:speech_arc>/voxforge_fr
- unpack them into the directory <~/.speechrc:speech_corpora>/voxforge_fr
Zamia (English, 5 minutes):
- Download all .tgz files into the directory <~/.speechrc:speech_arc>/zamia_en
- unpack them into the directory <~/.speechrc:speech_corpora>/zamia_en
Zamia (German, 18 hours):
- Download all .tgz files into the directory <~/.speechrc:speech_arc>/zamia_de
- unpack them into the directory <~/.speechrc:speech_corpora>/zamia_de

Technical note: For most corpora we have corrected transcripts in our databases which can be found in data/src/speech/<corpus_name>/transcripts_*.csv. As these have been created by many hours of (semi-) manual review they should be of higher quality than the original prompts so they will be used during training of our ASR models.

Once you have downloaded and, if necessary, converted a corpus you need to run

./speech_audio_scan.py <corpus name>

on it. This will add missing prompts to the CSV databases and convert audio files to 16kHz mono WAVE format.

Adding Artificial Noise or Other Effects

To improve noise resistance it is possible to derive corpora from existing ones with noise added:

./speech_gen_noisy.py zamia_de
./speech_audio_scan.py zamia_de_noisy
cp data/src/speech/zamia_de/spk2gender data/src/speech/zamia_de_noisy/
cp data/src/speech/zamia_de/spk_test.txt data/src/speech/zamia_de_noisy/
./auto_review.py -a zamia_de_noisy
./apply_review.py -l de zamia_de_noisy review-result.csv

This script will run recording through typical telephone codecs. Such a corpus can be used to train models that support 8kHz phone recordings:

./speech_gen_phone.py zamia_de
./speech_audio_scan.py zamia_de_phone
cp data/src/speech/zamia_de/spk2gender data/src/speech/zamia_de_phone/
cp data/src/speech/zamia_de/spk_test.txt data/src/speech/zamia_de_phone/
./auto_review.py -a zamia_de_phone
./apply_review.py -l de zamia_de_phone review-result.csv

Text Corpora

The following list contains text corpora that can be used to train language models with the scripts contained in this repository:

Europarl, specifically parallel corpus German-English and parallel corpus French-English:
- corresponding variable in .speechrc: europarl_de, europarl_en, europarl_fr
- sentences extraction: run ./speech_sentences.py europarl_de, ./speech_sentences.py europarl_en and ./speech_sentences.py europarl_fr
Cornell Movie--Dialogs Corpus:
- corresponding variable in .speechrc: cornell_movie_dialogs
- sentences extraction: run ./speech_sentences.py cornell_movie_dialogs
German Parole Corpus:
- corresponding variable in .speechrc: parole_de
- sentences extraction: train punkt tokenizer using ./speech_train_punkt_tokenizer.py, then run ./speech_sentences.py parole_de
WebQuestions: web_questions
- corresponding variable in .speechrc: web_questions
- sentences extraction: run ./speech_sentences.py web_questions
Yahoo! Answers dataset: yahoo_answers
- corresponding variable in .speechrc: yahoo_answers
- sentences extraction: run ./speech_sentences.py yahoo_answers
CNRTL Est Républicain Corpus, large corpus of news articles (4.3M headlines/paragraphs) available under a CC BY-NC-SA license. Download XML files and extract headlines and paragraphs to a text file with the following command: xmllint --xpath '//*[local-name()="div"][@type="article"]//*[local-name()="p" or local-name()="head"]/text()' Annee*/*.xml | perl -pe 's/^ +//g ; s/^ (.+)/$1\n/g ; chomp' > est_republicain.txt
- corresponding variable in .speechrc: est_republicain
- sentences extraction: run ./speech_sentences.py est_republicain

Sentences can also be extracted from our speech corpora. To do that, run:

English Speech Corpora
- ./speech_sentences.py voxforge_en
- ./speech_sentences.py librispeech
- ./speech_sentences.py zamia_en
- ./speech_sentences.py cv_corpus_v1
- ./speech_sentences.py ljspeech
- ./speech_sentences.py m_ailabs_en
- ./speech_sentences.py tedlium3
German Speech Corpora
- ./speech_sentences.py forschergeist
- ./speech_sentences.py gspv2
- ./speech_sentences.py voxforge_de
- ./speech_sentences.py zamia_de
- ./speech_sentences.py m_ailabs_de
- ./speech_sentences.py cv_de

Language Model

English

Prerequisites:

text corpora europarl_en, cornell_movie_dialogs, web_questions, and yahoo_answers are installed, sentences extracted (see instructions above).
sentences are extracted from speech corpora librispeech, voxforge_en, zamia_en, cv_corpus_v1, ljspeech, m_ailabs_en, tedlium3

To train a small, pruned English language model of order 4 using KenLM for use in both kaldi and wav2letter builds run:

./speech_build_lm.py generic_en_lang_model_small europarl_en cornell_movie_dialogs web_questions yahoo_answers librispeech voxforge_en zamia_en cv_corpus_v1 ljspeech m_ailabs_en tedlium3

to train a larger model of order 6 with less pruning:

./speech_build_lm.py -o 6 -p "0 0 0 0 1" generic_en_lang_model_large europarl_en cornell_movie_dialogs web_questions yahoo_answers librispeech voxforge_en zamia_en cv_corpus_v1 ljspeech m_ailabs_en tedlium3

to train a medium size model of order 5:

./speech_build_lm.py -o 5 -p "0 0 1 2" generic_en_lang_model_medium europarl_en cornell_movie_dialogs web_questions yahoo_answers librispeech voxforge_en zamia_en cv_corpus_v1 ljspeech m_ailabs_en tedlium3

German

Prerequisites:

text corpora europarl_de and parole_de are installed, sentences extracted (see instructions above).
sentences are extracted from speech corpora forschergeist, gspv2, voxforge_de, zamia_de, m_ailabs_de, cv_de

To train a small, pruned German language model of order 4 using KenLM for use in both kaldi and wav2letter builds run:

./speech_build_lm.py generic_de_lang_model_small europarl_de parole_de forschergeist gspv2 voxforge_de zamia_de m_ailabs_de cv_de

to train a larger model of order 6 with less pruning:

./speech_build_lm.py -o 6 -p "0 0 0 0 1" generic_de_lang_model_large europarl_de parole_de forschergeist gspv2 voxforge_de zamia_de m_ailabs_de cv_de

to train a medium size model of order 5:

./speech_build_lm.py -o 5 -p "0 0 1 2" generic_de_lang_model_medium europarl_de parole_de forschergeist gspv2 voxforge_de zamia_de m_ailabs_de cv_de

French

Prerequisites:

text corpora europarl_fr and est_republicain are installed, sentences extracted (see instructions above).
sentences are extracted from speech corpora voxforge_fr and m_ailabs_fr

To train a French language model using KenLM run:

./speech_build_lm.py generic_fr_lang_model europarl_fr est_republicain voxforge_fr m_ailabs_fr

Submission Review and Transcription

The main tool used for submission review, transcription and lexicon expansion is:

./speech_editor.py

Lexica/Dictionaries

NOTE: We use the terms lexicon and dictionary interchangably in this documentation and our scripts.

Currently, we have two lexica, one for English and one for German (in data/src/dicts):

dict-en.ipa
- English
- originally based on The CMU Pronouncing Dictionary (http://www.speech.cs.cmu.edu/cgi-bin/cmudict)
- additional manual and Sequitur G2P based entries
dict-de.ipa
- started manually from scratch
- once enough entries existed to train a reasonable Sequitur G2P model, many entries where converted from German wiktionary (see below)

The native format of our lexica is in (UTF8) IPA with semicolons as separator. This format is then converted to whatever format is used by the target ASR engine by the corresponding export scripts.

Sequitur G2P

Many lexicon-related tools rely on Sequitur G2P to compute pronunciations for words missing from the dictionary. The necessary models can be downloaded from our file server: http://goofy.zamia.org/zamia-speech/g2p/ . For installation, download and unpack them and then put links to them under data/models like so:

data/models/sequitur-dict-de.ipa-latest -> <your model dir>/sequitur-dict-de.ipa-r20180510
data/models/sequitur-dict-en.ipa-latest -> <your model dir>/sequitur-dict-en.ipa-r20180510

To train your own Sequitur G2P models, use the export and train scripts provided, e.g.:

[guenter@dagobert speech]$ ./speech_sequitur_export.py -d dict-de.ipa
INFO:root:loading lexicon...
INFO:root:loading lexicon...done.
INFO:root:sequitur workdir data/dst/dict-models/dict-de.ipa/sequitur done.
[guenter@dagobert speech]$ ./speech_sequitur_train.sh dict-de.ipa
training sample: 322760 + 16988 devel
iteration: 0
...

Manual Editing

./speech_lex_edit.py word [word2 ...]

is the main curses based, interactive lexicon editor. It will automatically produce candidate entries for new words using Sequitur G2P, MaryTTS and eSpeakNG. The user can then edit these entries manually if necessary and check them by listening to them being synthesized via MaryTTS in different voices.

The lexicon editor is also integrated into various other tools, speech_editor.py in particular which allows you to transcribe, review and add missing words for new audio samples within one tool - which is recommended.

I also tend to review lexicon entries randomly from time to time. For that I have a small script which will pick 20 random entries where Sequitur G2P disagrees with the current transcription in the lexicon:

./speech_lex_edit.py `./speech_lex_review.py`

Also, I sometimes use this command to add missing words from transcripts in batch mode:

./speech_lex_edit.py `./speech_lex_missing.py`

Wiktionary

For the German lexicon, entries can be extracted from the German wiktionary using a set of scripts. To do that, the first step is to extract a set of candidate entries from an wiktionary XML dump:

./wiktionary_extract_ipa.py

this will output extracted entries to data/dst/speech/de/dict_wiktionary_de.txt. We now need to train a Sequitur G2P model that translates these entries into our own IPA style and phoneme set:

./wiktionary_sequitur_export.py
./wiktionary_sequitur_train.sh

finally, we translate the entries and check them against the predictions from our regular Sequitur G2P model:

./wiktionary_sequitur_gen.py

this script produces two output files: data/dst/speech/de/dict_wiktionary_gen.txt contains acceptable entries, data/dst/speech/de/dict_wiktionary_rej.txt contains rejected entries.

Kaldi Models (recommended)

English NNet3 Chain Models

The following recipe trains Kaldi models for English.

Before running it, make sure all prerequisites are met (see above for instructions on these):

language model generic_en_lang_model_small built
some or all speech corpora of voxforge_en, librispeech, cv_corpus_v1, ljspeech, m_ailabs_en, tedlium3 and zamia_en are installed, converted and scanned.
optionally noise augmented corpora: voxforge_en_noisy, voxforge_en_phone, librispeech_en_noisy, librispeech_en_phone, cv_corpus_v1_noisy, cv_corpus_v1_phone, zamia_en_noisy and zamia_en_phone

./speech_kaldi_export.py generic-en-small dict-en.ipa generic_en_lang_model_small voxforge_en librispeech zamia_en 
cd data/dst/asr-models/kaldi/generic-en-small
./run-chain.sh

export run with noise augmented corpora included:

./speech_kaldi_export.py generic-en dict-en.ipa generic_en_lang_model_small voxforge_en cv_corpus_v1 librispeech ljspeech m_ailabs_en tedlium3 zamia_en voxforge_en_noisy librispeech_noisy cv_corpus_v1_noisy cv_corpus_v1_phone zamia_en_noisy voxforge_en_phone librispeech_phone zamia_en_phone

German NNet3 Chain Models

The following recipe trains Kaldi models for German.

Before running it, make sure all prerequisites are met (see above for instructions on these):

language model generic_de_lang_model_small built
some or all speech corpora of voxforge_de, gspv2, forschergeist, zamia_de, m_ailabs_de, cv_de are installed, converted and scanned.
optionally noise augmented corpora: voxforge_de_noisy, voxforge_de_phone, zamia_de_noisy and zamia_de_phone

./speech_kaldi_export.py generic-de-small dict-de.ipa generic_de_lang_model_small voxforge_de gspv2 [ forschergeist zamia_de ...]
cd data/dst/asr-models/kaldi/generic-de-small
./run-chain.sh

export run with noise augmented corpora included:

./speech_kaldi_export.py generic-de dict-de.ipa generic_de_lang_model_small voxforge_de gspv2 forschergeist zamia_de voxforge_de_noisy voxforge_de_phone zamia_de_noisy zamia_de_phone m_ailabs_de cv_de

Model Adaptation

For a standalone kaldi model adaptation tool that does not require a complete zamia-speech setup, see

kaldi-adapt-lm

Existing kaldi models (such as the ones we provide for download but also those you may train from scratch using our scripts) can be adapted to (typically domain specific) language models, JSGF grammars and grammar FSTs.

Here is an example how to adapt our English model to a simple command and control JSGF grammar. Please note that this is just a toy example - for real world usage you will probably want to add garbage phoneme loops to the grammar or produce a language model that has some noise resistance built in right away.

Here is the grammar we will use:

#JSGF V1.0;

grammar org.zamia.control;

public <control> = <wake> | <politeCommand> ;

<wake> = ( good morning | hello | ok | activate ) computer;

<politeCommand> = [ please | kindly | could you ] <command> [ please | thanks | thank you ];

<command> = <onOffCommand> | <muteCommand> | <volumeCommand> | <weatherCommand>;

<onOffCommand> = [ turn | switch ] [the] ( light | fan | music | radio ) (on | off) ;

<volumeCommand> = turn ( up | down ) the ( volume | music | radio ) ;

<muteCommand> = mute the ( music | radio ) ;

<weatherCommand> = (what's | what) is the ( temperature | weather ) ;

the next step is to set up a kaldi model adaptation experiment using this script:

./speech_kaldi_adapt.py data/models/kaldi-generic-en-tdnn_250-latest dict-en.ipa control.jsgf control-en

here, data/models/kaldi-generic-en-tdnn_250-latest is the model to be adapted, dict-en.ipa is the dictionary which will be used by the new model, control.jsgf is the JSGF grammar we want the model to be adapted to (you could specify an FST source file or a language model instead here) and control-en is the name of the new model that will be created.

To run the actual adaptation, change into the model directory and run the adaptation script there:

cd data/dst/asr-models/kaldi/control-en
./run-adaptation.sh

finally, you can create a tarball from the newly created model:

cd ../../../../..
./speech_dist.sh control-en kaldi adapt

wav2letter++ models

English Wav2letter Models

./wav2letter_export.py -l en -v generic-en dict-en.ipa generic_en_lang_model_large voxforge_en cv_corpus_v1 librispeech ljspeech m_ailabs_en tedlium3 zamia_en voxforge_en_noisy librispeech_noisy cv_corpus_v1_noisy cv_corpus_v1_phone zamia_en_noisy voxforge_en_phone librispeech_phone zamia_en_phone
pushd data/dst/asr-models/wav2letter/generic-en/
bash run_train.sh

German Wav2letter Models

./wav2letter_export.py -l de -v generic-de dict-de.ipa generic_de_lang_model_large voxforge_de gspv2 forschergeist zamia_de voxforge_de_noisy voxforge_de_phone zamia_de_noisy zamia_de_phone m_ailabs_de cv_de
pushd data/dst/asr-models/wav2letter/generic-de/
bash run_train.sh

auto-reviews using wav2letter

create auto-review case:

./wav2letter_auto_review.py -l de w2l-generic-de-latest gspv2

run it:

pushd tmp/w2letter_auto_review
bash run_auto_review.sh
popd

apply the results:

./wav2letter_apply_review.py

Audiobook Segmentation and Transcription (Manual)

Some notes on how to segment and transcribe audiobooks or other audio sources (e.g. from librivox) using the abook scripts provided:

(0/3) Convert Audio to WAVE Format

MP3

```bash
ffmpeg -i foo.mp3 foo.wav
```

MKV

mkvextract tracks foo.mkv 0:foo.ogg
opusdec foo.ogg foo.wav

(1/3) Convert Audio to 16kHz mono

sox foo.wav -r 16000 -c 1 -b 16 foo_16m.wav

(2/3) Split Audio into Segments

This tool will use silence detection to find good cut-points. You may want to adjust its settings to achieve a good balance of short-segments but few words split in half.

./abook-segment.py foo_16m.wav

settings:

[guenter@dagobert speech]$ ./abook-segment.py -h
Usage: abook-segment.py [options] foo.wav

Options:
  -h, --help            show this help message and exit
  -s SILENCE_LEVEL, --silence-level=SILENCE_LEVEL
                        silence level (default: 2048 / 65536)
  -l MIN_SIL_LENGTH, --min-sil-length=MIN_SIL_LENGTH
                        minimum silence length (default:  0.07s)
  -m MIN_UTT_LENGTH, --min-utt-length=MIN_UTT_LENGTH
                        minimum utterance length (default:  2.00s)
  -M MAX_UTT_LENGTH, --max-utt-length=MAX_UTT_LENGTH
                        maximum utterance length (default:  9.00s)
  -o OUTDIRFN, --out-dir=OUTDIRFN
                        output directory (default: abook/segments)
  -v, --verbose         enable debug output

by default, the resulting segments will end up in abook/segments

(3/3) Transcribe Audio

The transcription tool supports up to two speakers which you can specify on the command line. The resulting voxforge-packages will end up in abook/out by default.

./abook-transcribe.py -s speaker1 -S speaker2 abook/segments/

Audiobook Segmentation and Transcription (kaldi)

Some notes on how to segment and transcribe semi-automatically audiobooks or other audio sources (e.g. from librivox) using kaldi:

Directory Layout

Our scripts rely on a fixed directory layout. As segmentation of librivox recordings is one of the main applications of these scripts, their terminology of books and sections is used here. For each section of a book two source files are needed: a wave file containing the audio and a text file containing the transcript.

A fixed naming scheme is used for those which is illustrated by this example:

abook/in/librivox/11442-toten-Seelen/evak-11442-toten-Seelen-1.txt
abook/in/librivox/11442-toten-Seelen/evak-11442-toten-Seelen-1.wav
abook/in/librivox/11442-toten-Seelen/evak-11442-toten-Seelen-2.txt
abook/in/librivox/11442-toten-Seelen/evak-11442-toten-Seelen-2.wav
...

The abook-librivox.py script is provided to help with retrieval of librivox recordings and setting up the directory structure. Please note that for now, the tool will not retrieve transcripts automatically but will create empty .txt files (according to the naming scheme) which you will have to fill in manually.

The tool will convert the retrieved audio to 16kHz mono wav format as required by the segmentation scripts, however. If you intend to segment material from other sources, make sure to convert it to that format. For suggestions on what tools to use for this step, please refer to the manual segmentation instructions in the previous section.

NOTE: As the kaldi process is parallelized for mass-segmentation, at least 4 audio and prompt files are needed for the process to work.

(1/4) Preprocess the Transcript

This tool will tokenize the transcript and detect OOV tokens. Those can then be either replaced or added to the dictionary:

./abook-preprocess-transcript.py abook/in/librivox/11442-toten-Seelen/evak-11442-toten-Seelen-1.txt

(2/4) Model adaptation

For the automatic segmentation to work, we need a GMM model that is adapted to the current dictionary (which likely had to be expanded during transcript preprocessing) plus uses a language model that covers the prompts.

First, we create a language model tuned for our purpose:

./abook-sentences.py abook/in/librivox/11442-toten-Seelen/*.prompt
./speech_build_lm.py abook_lang_model abook abook abook parole_de

Now we can create an adapted model using this language model and our current dict:

./speech_kaldi_adapt.py data/models/kaldi-generic-de-tri2b_chain-latest dict-de.ipa data/dst/lm/abook_lang_model/lm.arpa abook-de
pushd data/dst/asr-models/kaldi/abook-de
./run-adaptation.sh
popd
./speech_dist.sh -c abook-de kaldi adapt
tar xfvJ data/dist/asr-models/kaldi-abook-de-adapt-current.tar.xz -C data/models/

(3/4) Auto-Segment using Kaldi

Next, we need to create the kaldi directory structure and files for auto-segmentation:

./abook-kaldi-segment.py data/models/kaldi-abook-de-adapt-current abook/in/librivox/11442-toten-Seelen

now we can run the segmentation:

pushd data/dst/speech/asr-models/kaldi/segmentation
./run-segmentation.sh 
popd

(4/4) Retrieve Segmentation Result

Finally, we can retrieve the segmentation result in voxforge format:

./abook-kaldi-retrieve.py abook/in/librivox/11442-toten-Seelen/

Training Voices for Zamia-TTS

Zamia-TTS is an experimental project that tries to train TTS voices based on (reviewed) Zamia-Speech data. Downloads here:

https://goofy.zamia.org/zamia-speech/tts/

Tacotron 2

This section describes how to train voices for NVIDIA's Tacotron 2 implementation. The resulting voices will have a sample rate of 16kHz as that is the default sample rate used for Zamia Speech ASR model training. This means that you will have to use a 16kHz waveglow model which you can find, along with pretrained voices and sample wavs here:

https://goofy.zamia.org/zamia-speech/tts/tacotron2/

now with that out of the way, Tacotron 2 model training is pretty straightforward. First step is to export filelists for the voice you'd like to train, e.g.:

./speech_tacotron2_export.py -l en -o ../torch/tacotron2/filelists m_ailabs_en mailabselliotmiller

next, change into your Tacotron 2 training directory

cd ../torch/tacotron2

and specify file lists, sampling rate and batch size in ''hparams.py'':

diff --git a/hparams.py b/hparams.py
index 8886f18..75e89c9 100644
--- a/hparams.py
+++ b/hparams.py
@@ -25,15 +25,19 @@ def create_hparams(hparams_string=None, verbose=False):
         # Data Parameters             #
         ################################
         load_mel_from_disk=False,
-        training_files='filelists/ljs_audio_text_train_filelist.txt',
-        validation_files='filelists/ljs_audio_text_val_filelist.txt',
-        text_cleaners=['english_cleaners'],
+        training_files='filelists/mailabselliotmiller_train_filelist.txt',
+        validation_files='filelists/mailabselliotmiller_val_filelist.txt',
+        text_cleaners=['basic_cleaners'],
 
         ################################
         # Audio Parameters             #
         ################################
         max_wav_value=32768.0,
-        sampling_rate=22050,
+        #sampling_rate=22050,
+        sampling_rate=16000,
         filter_length=1024,
         hop_length=256,
         win_length=1024,
@@ -81,7 +85,8 @@ def create_hparams(hparams_string=None, verbose=False):
         learning_rate=1e-3,
         weight_decay=1e-6,
         grad_clip_thresh=1.0,
-        batch_size=64,
+        # batch_size=64,
+        batch_size=16,
         mask_padding=True  # set model's padded outputs to padded values
     )

and start the training:

python train.py --output_directory=elliot --log_directory=elliot/logs

Tacotron

(1/2) Prepare a training data set

./ztts_prepare.py -l en m_ailabs_en mailabselliotmiller elliot

(2/2) Run the training

./ztts_train.py -v elliot 2>&1 | tee train_elliot.log

Model Distribution

To build tarballs from models, use the speech-dist.sh script, e.g.:

./speech_dist.sh generic-en kaldi tdnn_sp

License

My own scripts as well as the data I create (i.e. lexicon and transcripts) is LGPLv3 licensed unless otherwise noted in the script's copyright headers.

Some scripts and files are based on works of others, in those cases it is my intention to keep the original license intact. Please make sure to check the copyright headers inside for more information.

Authors

Guenter Bartsch [email protected]
Marc Puels [email protected]
Paul Guyot [email protected]

zamia-speech's People

Contributors

Stargazers

Watchers

Forkers

opencvbaby slbinilkumar aitorbajo entn-at robertocaiwu mpuels thetimeofblack shivamgupta211 mbencherif svenha durgesh92 aascode chr1st0p fancyerii qoboty qiufengyun skyfish-qc mustafaxfe the01 dtretter dimatter arthurarthura pguyot mwang-lifesize neil-119 oscarpang templeblock netzzitrone dataworm mingewang freebooterish alexburlis thedevelolper 21rachitshukla cddypang junshipeng srinivest drawfish basrizk ivalexm xrick varunsrivastava19 ai-learn-use makinglong myhololens balancewing reloadbrain morsedev daanzu yanjun-zh whaozl miclast alex-ht 0x38 zhiguangzhang madkote mrezai ppvastar addie11 ar13pit teddius luweishuang phillip1029 shiyuzh2007 vinothkumart rxhmdia sajidnaeem056 dpny518 ajilim yangzhou1028 rogervaas a-rose talal-sen dophist benhuang2018 rk-baku ishine xbsdsongnan sadam1195 sciai-ai mshans66 ghadeerjaradat2 jcarlosneto

zamia-speech's Issues

Installing py-nltools and py-kaldi-asr does not find atlas

Thanks for this very powerful project!
I am trying to install py-nltools which in turn wants to install py-kaldi-asr:
...
Processing py-kaldi-asr-0.2.4.tar.gz
Writing /tmp/easy_install-TO1N7N/py-kaldi-asr-0.2.4/setup.cfg
Running py-kaldi-asr-0.2.4/setup.py -q bdist_egg --dist-dir /tmp/easy_install-TO1N7N/py-kaldi-asr-0.2.4/egg-dist-tmp-e5cWqN
looking for atlas library, trying pkg-config first...
Package atlas was not found in the pkg-config search path.
Perhaps you should add the directory containing `atlas.pc'
to the PKG_CONFIG_PATH environment variable
No package 'atlas' found

I have a correct kaldi installation (under Ubuntu 17.10), but there is no atlas.pc, only
/usr/lib/x86_64-linux-gnu/pkgconfig/blas-atlas.pc
/usr/lib/x86_64-linux-gnu/pkgconfig/lapack-atlas.pc

Any suggestions?

How to create 'data/dst/speech/%s/ai-sentences.txt'?

Hi Guenter,

I'm trying to get your scripts running and have a question regarding the training of the German language model. I've run

speech$ ./speech_sentences_de.py --train-punkt

which writes speech/data/dst/speech/de/punkt.pickle. And

speech$ ./speech_sentences_de.py

writes speech/data/dst/speech/de/sentences.txt. So far so good. Now I'd like to run speech_build_lm.py, but according to the lines

SOURCES = ['data/dst/speech/%s/sentences.txt',
           'data/dst/speech/%s/ai-sentences.txt']

it also needs the file 'data/dst/speech/%s/ai-sentences.txt'. The command

speech$ grep -rF ai-sentences.txt .

yielded

speech_build_lm.py:           'data/dst/speech/%s/ai-sentences.txt']

So the question is: What does ai-sentences.txt contain and how do I create it? To train the language model ai-sentences.txt is not necessary, because we have sentences.txt. But I'd like to know where ai-sentences.txt comes from 😄

Thanks for your help in advance!

Cheers,
Marc

data/src/speech/kaldi-run-nnet3.sh

Hello, I am training the kaldi-tuda dataset using the kaldi-run-nnet3.sh but I have problems when running train_tdnn.sh. It seems that script was deprecated. What version of kaldi are you using to traing the models you have made available in voxforge?
Since it is deprecated, they point to train_dnn.py but then I run into other problems.

When running train_tdnn.sh I am getting:
steps/nnet3/train_tdnn.sh: THIS SCRIPT IS DEPRECATED steps/nnet3/train_tdnn.sh --stage -10 --num-epochs 8 --num-jobs-initial 2 --num-jobs-final 14 --splice-indexes -4,-3,-2,-1,0,1,2,3,4 0 -2,2 0 -4,4 0 --feat-type raw --online-ivector-dir exp/nnet3/ivectors_train --cmvn-opts --norm-means=false --norm-vars=false --initial-effective-lrate 0.005 --final-effective-lrate 0.0005 --cmd run.pl --pnorm-input-dim 2000 --pnorm-output-dim 250 --minibatch-size 128 data/train data/lang exp/tri1_ali exp/nnet3/nnet_tdnn_a feat-to-dim scp:exp/nnet3/ivectors_train/ivector_online.scp - steps/nnet3/train_tdnn.sh: creating neural net configs steps/nnet3/make_tdnn_configs.py --splice-indexes -4,-3,-2,-1,0,1,2,3,4 0 -2,2 0 -4,4 0 --feat-dim 13 --ivector-dim 100 --pnorm-input-dim 2000 --pnorm-output-dim 250 --use-presoftmax-prior-scale true --num-targets 1925 exp/nnet3/nnet_tdnn_a/configs steps/nnet3/train_tdnn.sh: calling get_egs.sh steps/nnet3/get_egs.sh --cmvn-opts --norm-means=false --norm-vars=false --feat-type raw --online-ivector-dir exp/nnet3/ivectors_train --transform-dir exp/tri1_ali --left-context 10 --right-context 10 --samples-per-iter 400000 --stage 0 --cmd run.pl --frames-per-eg 8 data/train exp/tri1_ali exp/nnet3/nnet_tdnn_a/egs steps/nnet3/get_egs.sh: invalid option --feat-type

Can you give any suggestions?

Binary AI Packages for arm64 architecture

hey...
i want to run zamia-speech on xilinx zcu102 board with debian
arm architecture is arm64

Replace arg --lang with --lang-model and --audio-corpus

Introduction

Currently, most scripts offer the argument --lang to choose between de and en. The choice de means that a language model is trained on the text corpora

Europarl
Parole

and that an acoustic model is trained on

a corrected version of the VoxForge corpus
a corrected version of the TU-Darmstadt corpus (gspv2)

There is no way to choose for example just one text corpus and one audio corpus. Hereby I propose to change the command line arguments of some scripts to make it simpler to pick text and audio corpora to train an ASR system.

Current workflow

An example workflow to train a German speech recognition system might look like

$ ./speech_sentences_de.py
$ ./speech_build_lm.py --lang de
$ ./speech_kaldi_export.py --lang de
$ cd data/dst/speech/de/kaldi
$ ./build-lm.sh
$ ./run-chain.sh

Under the hood speech_sentences_de.py extracts sentences from two text corpora (Europarl and Parole) and writes them to a text file containing one sentence per line. Both of the corpora have to be parsed in a unique way to extract sentences from them. Currently, there is no way to build a language model just on exactly one corpus (except by altering the script of course).

The command speech_build_lm.py --lang de concatenates text files containing one sentence per line and builds a language model (3-gram) using the program ngram.

The command speech_kaldi_export.py --lang de consumes all transcripts for the VoxForge and TU-Darmstadt corpora in data/src/speech/de/transcripts_*.csv, the pronunciation dictionary data/src/speech/de/dict.ipa and the language model created in the previous step and "deploys" them to speech/data/dst/speech/de/kaldi in a way that adheres to the Kaldi interface (regarding directory and file structure). Currently there is no way to conveniently choose exactly one, two, or more audio corpora. But it would be nice to be able to choose a small audio corpus to do regression testing.

The script build-lm.sh converts the language model from the ARPA format to the finite state transducer G.fst - as Kaldi expects it.

Finally, run-chain.sh uses Kaldi to train the acoustic model.

Here is an example sequence of commands to train a complete English ASR system with Kaldi:

$ ./speech_sentences_en.py
$ ./speech_build_lm.py --lang en
$ ./speech_kaldi_export.py --lang en
$ cd data/dst/speech/en/kaldi
$ ./build-lm.sh
$ ./run-chain.sh

Proposed workflow

This is an example of the proposed workflow for creating an ASR system:

$ ./speech_sentences.py europarl-de
$ ./speech_sentences.py parole
$ ./speech_sentences.py voxforge-de-prompts
$ ./speech_build_lm.py europarl-de \
                       parole \
                       voxforge-de-prompts \
                       lm-europarl-de-parole-voxforge-de-prompts
$ ./speech_kaldi_export.py --audio-corpus voxforge \
                           --audio-corpus tu-darmstadt \
                           --language-model lm-europarl-de-parole-voxforge-de-prompts \
                           --dictionary dict-de.ipa \
                           --model-name experiment1
$ cd data/dst/speech/kaldi/experiment1
$ ./build-lm.sh
$ ./run-chain.sh

In the example we create an ASR system where the language model is based on the Europarl and Parole corpora. To train the acoustic model, the VoxForge and TU-Darmstadt corpuora are used.

The command ./speech_sentences.py TEXTCORPUS writes the extracted sentences by convention to data/dst/text-corpora/TEXTCORPUS.txt.

The command speech_build_lm.py TEXTCORPUS [TEXTCORPUS ...] LMNAME expects as arguments a list of text corpora and a name for the resulting language model (LMNAME). The files of the language model are written to the directory data/dst/lm/LMNAME/.

The command speech_kaldi_export.py expects one or more --audio-corpus, exactly one --language-model, and exactly one --model-name MODELNAME. The script creates a directory data/dst/asr-models/kaldi/MODELNAME and places all files required by Kaldi in it.

@gooofy What do you think about my suggested changes?

UPDATES 2018-04-03

Augmented example workflow to show that speech_sentences.py can also extract sentences from VoxForge's prompts.
In example for proposed workflow: add --dictionary to speech_kaldi_export.py

Mac Osx ?

How do I set up installation on Mac Osx ?
Thank you

Experiences from the new modular setup

Hi.

I followed the new modular approach from README.md (for German). It worked smoothly! I have some minor edits that I can put into a pull request.

In contrast to my previous zamia-speech experiments, I did not add the sentences from the speech corpora to the language model (i.e. I only processed europarl_de and parole_de with speech_sentences.py) because I think that this is a fairer evaluation. As expected, the WER jumped up. Does the following make sense or has anybody achieved lower WERs in a comparable setup?
%WER 11.57 [ 20312 / 175507, 3225 ins, 2930 del, 14157 sub ] exp/nnet3_chain/tdnn_sp/decode_test/wer_9_1.0
%WER 13.11 [ 23010 / 175507, 3531 ins, 3211 del, 16268 sub ] exp/nnet3_chain/tdnn_250/decode_test/wer_8_1.0
I configured the ASR with:
./speech_kaldi_export.py generic-de-small dict-de.ipa generic_de_lang_model voxforge_de gspv2 forschergeist zamia_de zamia_de_noisy

German STT dont working

Hello,

i am trying to test pre-trained Kaldi models with german language. It shows errors:

debian-admin@debian:~$ python kaldi_decode_wav.py -v -m /opt/kaldi/model/kaldi-generic-de-tdnn_sp Wie_geht_es_dir.wav
DEBUG:root:/opt/kaldi/model/kaldi-generic-de-tdnn_sp loading model...
DEBUG:root:/opt/kaldi/model/kaldi-generic-de-tdnn_sp loading model... done, took 52.890310s.
DEBUG:root:/opt/kaldi/model/kaldi-generic-de-tdnn_sp creating decoder...
DEBUG:root:/opt/kaldi/model/kaldi-generic-de-tdnn_sp creating decoder... done, took 0.695156s.
Traceback (most recent call last):
File "kaldi_decode_wav.py", line 72, in
if decoder.decode_wav_file(wavfile):
File "kaldiasr/nnet3.pyx", line 195, in kaldiasr.nnet3.KaldiNNet3OnlineDecoder.decode_wav_file (kaldiasr/nnet3.cpp:4362)
AssertionError

I had installed with this guide https://github.com/gooofy/zamia-speech#debian-9-stretch-amd64

Can you tell me, why is not working :(. Thank you.

Hardware requirements to trains Kaldi models

Dear Guenter,
It will be very helpful to know hardware requirements, to avoid problems with lack of RAM, HDD or GPU RAM and wasting time to trying to train using not enough powerful computer.

I understand that there may be no well defined requirements, everything depends from used corpora, configs, etc.

But would you mind at least sharing your hardware spec to understand what was enough to build kaldi-generic-en-tdnn_f model. And how much time it took.
Thank you!

Excluding multilingual parole_de files

The file train_all.txt generated during the construction of the German language model contains many English sentences. Some boiler plate sentences can be removed from data/dst/text-corpora/ with a tiny grep call, but I would like to exclude whole texts from parole_de because they are multilingual, e.g. geheimdienst.sgm. My current solution is to rename such files so that their suffix is not .sgm anymore (that's what the script is looking for). Is there a cleaner approach like an exclude file?

Missing creation of tmp

*_to_vf.sh scripts try to write a script called tmp/run_parallel.sh, yet tmp is not created beforehand, yielding a crash.

Unable to do Kaldi online decoding with kaldi-generic-en-tdnn_sp-r20190227

I Downloaded kaldi-generic-en-tdnn_sp-r20190227 and tried to use it with the kaldi online decoders.
I used the following arguments:

./online2-tcp-nnet3-decode-faster --feature-type=mfcc --min-active=200 --max-active=8000 \
--beam=13.0 --lattice-beam=6.0 --acoustic-scale=1.0 --frames-per-chunk=51 \
--frame-subsampling-factor=3 --endpoint.silence-phones=1:2:3:4:5:6:7:8:9:10:11:12:13:14:15 --ivector-silence-weighting.silence-weight=1e-3 --ivector-silence-weighting.silence-phones=1:2:3:4:5:6:7:8:9:10:11:12:13:14:15 \
--ivector-extraction-config=/home/models/kaldi-generic-en-tdnn_sp-r20190227/ivectors_test_hires/conf/ivector_extractor.conf --samp-freq=16000 --mfcc-config=/home/models/kaldi-generic-en-tdnn_sp-r20190227/conf/mfcc.conf \
/home/models/kaldi-generic-en-tdnn_sp-r20190227/model/final.mdl /home/models/kaldi-generic-en-tdnn_sp-r20190227/model/graph/HCLG.fst /home/models/kaldi-generic-en-tdnn_sp-r20190227/model/graph/words.txt

And I keep getting the following error:
ERROR (online2-tcp-nnet3-decode-faster[5.5.313~1-203c]:OnlineTransform():online-feature.cc:521) Dimension mismatch: source features have dimension 91 and LDA #cols is 280
The same arguments work fine with all the other models I use (not from Zamia). Do you know why that happens?
do you have the online.conf that you use with kaldi?

run time error

I tried to run pre trained model
./kaldi-generic-en-tdnn_fl-r20190609/'

but i am getting run time error.

M-AILABS dataset available on caito.de

M-AILABS' website m-ailabs.bayern seems offline (hostname doesn't seem to resolve). The corpus seems to be available at caito.de. Maybe should we add a link to https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/ in README.md ?

add processing files for mozilla cv corpus v2

can we add processing file for mozilla cv corpus version 2? Thanks

Model setting

Hi Guenter,
I found some project and want to use German model from this project. But it looks like the model have some different structure (it was configured to work with Kaldi Gstreamer Server software). Сan you give any instructions or tips on how to set up this model for work with zamia-speech? I'm trying to use it in Live Mic Demo.

Factorized TDNN models not working

OS: Debian 9 Stretch 64bit
Expected behavior: Factorized ASR models work the same way as the previous models
Observed: Previous models work fine, factorized ones (both de and en) create a RuntimeError

I've been trying to use the factorized TDNN models but getting a RuntimeError in the wave decoder test.
Here is the command I've used:

python kaldi_decode_wav.py -v -m /opt/kaldi/model/kaldi-generic-de-tdnn_f test.wav

and here is the error message:

Traceback (most recent call last): File "kaldi_decode_wav.py", line 60, in <module> kaldi_model = KaldiNNet3OnlineModel (options.modeldir, acoustic_scale=1.0, beam=7.0, frame_subsampling_factor=3) File "kaldiasr/nnet3.pyx", line 134, in kaldiasr.nnet3.KaldiNNet3OnlineModel.__cinit__ (kaldiasr/nnet3.cpp:3549) RuntimeError

What I've tried to far is to update the Debian packages (python-kaldiasr python-nltools, no updates where available) and the kaldi_decode_wav.py.

Live Mic Demo stops decoding after few words

Hello,
I have just discovered zamia-speech and trying to get the live mic demo running. Everything works just fine, however, the decoding suddenly stops after few words (usually something around 10). Do you have any suggestions why that could be?

ipa and oovs

In the https://github.com/gooofy/zamia-speech#model-adaptation
you mentioned, and https://github.com/gooofy/kaldi-adapt-lm
there seemed only limited to some existing vocabulary.
Is there some configuration that can use my own vocabulary as well as LM?
thanks

Also, have a trouble understanding the sampa format, current kaldi seemed only use CMU phoneset . Is there a way that I can use the kaldi trained g2p model (trained with CMU phoneset)?

transcripts/wav data source

Hi,

quick question about your data sources: after pulling voxforge data and downloading TUDA audio files (voxforge forums), I still run into problems compiling the kaldi models because of missing wav files.

I checked the transcripts.csv file and it contains roughly 130000 utterances. Checking file names against wav files that are available after pulling all the data it turns out there are around 88000 missing wavs.

So my question is: Is there any chance that there's another source of utterance recordings that's not pulled automatically?

Thanks in advance.

Missing file run-nnet3.sh

I followed the documentation and was quite successful in replicating. But I could not find the file ./run-nnet3.sh mentioned in section https://github.com/gooofy/speech#nnet3-models .

Sequitur model umlaut

The sequitur model does not seem able to deal with words containing umlauts (e.g. "fünfzig") when I call sequitur_gen_ipa() directly and I couldn't determine how zamia-speech deals with this.
Is there a way to autmatically generate phonemes for words with umlauts?

Missing "tree" while modifying the language model with the existing voxforge model

Hi,
Watching your discussion on "Optimizing nnet3/chain models for speed and memory consumption" really helped me a lot. Because my decoding is just limited in a small situation, I tried to built a small HCLG.fst so that the decoding will be faster. But as I followed the steps in Online decoding in Kaldi-Example for using your own language model with existing online-nnet2 models , I found the already-built model missing the "tree" file. Could you please share it which is expected in the "tdnn_250" and "tdnn_sp"?

Thanks in advance. ^_^

old environment.yml

Hi
It seems environment.yml is too old and this demo:
http://goofy.zamia.org/zamia-speech/misc/kaldi_decode_live.py
isn't compatible with it.
Is there any plan to update this file with new version of compatible libraries?

Way to add new word

Hi,

I would like adding a new word, but this word doesn't exist in CMU pronounce dictionary. So I used logios lexicon tool to generate a new pronunciation dictionary. I would like to know how to add this new dictionary to the existing model? Thank you so much.

how to create decode data from socket recording audio ?

hello~
i made kaldi decode live server with my pc and
i made my android application to send my voice to server with socket.
i'd like to use socket streaming record data not pulseaudio from pc.
please help me

my code is like below. it did not work
--> samples=record(sc)
--> audio, finalize = vad.process_audio(samples)

thank you!


    s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    s.bind((HOST, PORT))
    s.listen(10)
    while(True):
        print "Listening at:", s.getsockname()
        sc, addr=s.accept()
        print "Connection established with:", addr
       

        while True:
            samples=record(sc)
            audio, finalize = vad.process_audio(samples)
	    if not audio:
		continue
	    user_utt, confidence = asr.decode(audio, finalize, stream_id=STREAM_ID)

	    print "\r%s                     " % user_utt,

	    if finalize:
		print

Docker image?

Do you have any docker images for this?

Runtime error on Demo program

I have been following the Github Quickstart link which converts 4 demo wavs files to text.
It works fine, but now when I use my own Wav file it throughs an error as below:

Traceback (most recent call last): File "kaldi_decode_wav.py", line 72, in <module> if decoder.decode_wav_file(wavfile): File "kaldiasr/nnet3.pyx", line 207, in kaldiasr.nnet3.KaldiNNet3OnlineDecoder.decode_wav_file (kaldiasr/nnet3.cpp:4726) File "kaldiasr/nnet3.pyx", line 170, in kaldiasr.nnet3.KaldiNNet3OnlineDecoder.decode (kaldiasr/nnet3.cpp:3968) RuntimeError`

The file I am using is a vimeo video converted to wav using youtube-dl.
get the wav file using this command

youtube-dl --extract-audio --audio-format wav https://vimeo.com/73643788

I give this file as input to the kaldi_decode_wav.py

Can anyone help me what thing I am doing wrong?

android

Have you tried running on the android side?

Adaptation script fails at multiple stages

I have been trying to run the adaptation script. I followed the instructions but the script fails at multiple stages. I tried to correct but then gave up after dozens of errors. Has anybody ran it successfully?

Can you run this on Windows?

Hey I noticed that you don't have instructions for running this on Windows. It is possible?

Many thanks,
Kiran

TDNN model performance

I have a question regarding the performance (in terms of accuracy) of kaldi models (specifically kaldi-generic-en-tdnn_f) versus other well-known engines from Google and Watson when doing speech-to-text. From my preliminary testing (on telephony data), kaldi models are not nearly as accurate as that of Google and Watson. My typical audio file is ~ 30-60 secs long.

Since I'm not sure how this model was trained, should I chunk the audio file to multiple files at sentence length to improve the performance ? Has anyone had good luck with Kaldi in terms of performance as compared to Google and Watson?

phone '

Hi
I am trying to map the non-silenence phones to IPA or Arpabet
I tried this but it is not complete
https://github.com/gooofy/py-nltools/blob/master/nltools/phonetics.py

What does the '
represent

Thank you

Acoustic model training - how to fine tune/train from scratch with custom wav file

I prepared the dataset similar to zamia-en dataset with 3000 wav files and prompts.
1.How to fine tune existing acoustic model
2.How to train the acoustic model from scratch

Missing utils/prepare_lang.sh

Hi,

I am trying to adapt a model but running into some trouble when invoking the "run-adaption.sh" script. It fails to call the script "utils/prepare_lang.sh" and others like "utils/mkgraph.sh" as well.
After some investigation I assume this is what is missing:
https://github.com/kaldi-asr/kaldi/tree/master/egs/wsj/s5/utils

I made a simple raspberry setup according to the instructions in the readme. I get the /opt/kaldi folder but obviously some directories like "egs" are missing there.

Is this a bug, should it work out of the box or are there others things that should be installed/configured first?

nspc

in the lexicon of the downloaded asr models, the phonems for nspc are
nC
nspc nC

how ever the nC does not exist, shouldn't this be SPN?
Does nspc represent in the kaldi standard recipes

LibriVox part could reuse existing data

The repo describes ways to use LibriVox. Hence, the following data release derived from LibriVox might be interesting: http://www.m-ailabs.bayern/en/the-mailabs-speech-dataset/
The audio files are segmented (by sentence) and transcribed. For German, the files amount to 237 hours, but they also provide data for the following languages: en_UK en_US es_ES it_IT uk_UK ru_RU fr_FR pl_PL

How much GPU RAM is needed for training?

I retrained a German model, but it runs out of GPU RAM in:
nnet3-chain-train --use-gpu=wait --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --read-cache=exp/nnet3_chain/tdnn_250/cache.1 --write-cache=exp/nnet3_chain/tdnn_250/cache.2 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=2.0 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=1.0 --srand=1 "nnet3-am-copy --raw=true --learning-rate=0.000998730064726 --scale=0.980025398705 exp/nnet3_chain/tdnn_250/1.mdl - |" exp/nnet3_chain/tdnn_250/den.fst "ark,bg:nnet3-chain-copy-egs --frame-shift=2 ark:exp/nnet3_chain/tdnn_250/egs/cegs.2.ark ark:- | nnet3-chain-shuffle-egs --buffer-size=5000 --srand=1 ark:- ark:- | nnet3-chain-merge-egs --minibatch-size=512 ark:- ark:- |" exp/nnet3_chain/tdnn_250/2.1.raw

I have 4 GB GPU RAM, which is only used by kaldi in exclusive mode. This was enough back in June. What are your experiences or recommendations?

The error message contains this:
ERROR (nnet3-chain-train[5.5.95~1-4bdb]:AllocateNewRegion():cu-allocator.cc:513) Failed to allocate a memory region of 356515840 bytes. Possibly smaller minibatch size would help. Memory info: free:190M, used:3844M, total:4035M, free/total:0.0473123

So, should I retry with --minibatch-size=384 (instead of 512)? This value makes this step complete, but I probably have to rerun all steps to be sure.

Raspbian Import Error - libkaldi-decoder.so

Hi, I tried to make a new and clean setup on a raspberry just like I did 3 weeks ago (simply with the apt get command). But now I suddenly get this error message:

pi@raspberrypi:~ $ python kaldi_decode_wav.py -v demo?.wav
Traceback (most recent call last):
  File "kaldi_decode_wav.py", line 36, in <module>
    from kaldiasr.nnet3 import KaldiNNet3OnlineModel, KaldiNNet3OnlineDecoder
ImportError: libkaldi-decoder.so: cannot open shared object file: No such file or directory

Possibly one of those raspbian packages got broken during one of the latest commits?

G2P generates differnet phones than are in the dictionary

I used the G2P to generate pronunciation for OOV, but the symbols its generate are not in the original dictionary

speech_kaldi_export.py: Wrong number of arguments for sequitur_gen_ipa()

When executing speech_kaldi_export.py --add-all an exception is thrown indicating that sequitur_gen_ipa() is called with the wrong number of arguments. Indeed, it is called in speech_kaldi_export.py with 1 argument, but according to sequiturclient.py it expects 2.

KALDI - nnet3-latgen-faster dimension mismatch 40 vs 13

Hi,

I'm computing the features for kaldi-generic-en-tdnn_f-r20190227 model.
compute-mfcc-feats
--config=kaldi-generic-en-tdnn_f-r20190227/conf/mfcc.conf
scp:transcripts/wav.scp
ark,scp:transcripts/feats.ark,transcripts/feats.scp
and running
nnet3-latgen-faster
--word-symbol-table=kaldi-generic-en-tdnn_f-r20190227/model/graph/words.txt
kaldi-generic-en-tdnn_f-r20190227/model/final.mdl
kaldi-generic-en-tdnn_f-r20190227/model/graph/HCLG.fst
ark:transcripts/feats.ark
ark,t:transcripts/lattices.ark;
Getting the following error:
ERROR (nnet3-latgen-faster[5.5.337~1-35f96]:EnsureFrameIsComputed():nnet-am-decodable-simple.cc:101) Neural net expects 'input' features with dimension 40 but you provided 13

Superfluous call of utils/prepare_lang.sh in kaldi-run-chain.sh?

Both files https://github.com/gooofy/speech/blob/9735ff36e47a48ae22714863bf7815b5d81552aa/data/src/speech/kaldi-run-lm.sh#L50 and https://github.com/gooofy/speech/blob/9735ff36e47a48ae22714863bf7815b5d81552aa/data/src/speech/kaldi-run-chain.sh#L69 contain the line

utils/prepare_lang.sh data/local/dict "nspc" data/local/lang data/lang

Isn't it enough if only kaldi-run-lm.sh prepares the language?

DNN miss alignment

I have been trying to do force alignment with steps/nnet3/align.sh but the alignment is really off

Is there any comments about this, i did the same steps to the other models in kaldi and they are all fine

Wiktionary results for de/dict.ipa

The new extraction of pronunciation information from Wiktionary delivers an impressive lexicon!

Two minor remarks:

Some forms are missing like "staatsflagge", "staatskonzern" although the plural forms are extracted correctly.
Some Wiktionary entries seem to be better than old entries in dict.ipa, e.g. "staatseigentum" and "staatseigentums" in dict.ipa currently have different IPA signs for the last "e" (and "u") while the Wiktionary versions seem to be ok.

./run-adaptation.sh : Lots of files missing errors during the process

During the ./run-adaptation.sh process following errors are raised. I have been manually copying these files from other models (but I assume that is certainly wrong). Why are these files missing? Any way to fix this.

data/lang.adapt_test/phones/silence.csl
data/lang.adapt_test/phones/disambig.in
data/lang.adapt_test/L_disambig.fst

'noisy...' files break run-chain.sh

I am sorry for opening another issue, but run-chain.sh complains about the missing noisy files like data/speech/de/16kHz/noisyManu-20140324-m19_deM19-32.wav and stops. Should I avoid the noisy files currently or should I try to generate them? In both cases, I will need a small hint how to proceed.
Thanks.

Using KenLM instead of SRILM?

KenLM is apparently used to adapt kaldi models in project https://github.com/gooofy/kaldi-adapt-lm, yet SRILM is used in speech_build_lm.py. What is the rationale? Cannot KenLM be used in speech_build_lm.py phase instead?

w2l_env_activate,w2l_train default values in speechrc

What are the default values for w2l_env_activate,w2l_train ? they are missing from speechrc file.

./wav2letter_export.py -l en -v generic-en dict-en.ipa generic_en_lang_model_medium voxforge_en cv_corpus_v1 librispeech ljspeech m_ailabs_en tedlium3 zamia_en voxforge_en_noisy librispeech_noisy cv_corpus_v1_noisy cv_corpus_v1_phone zamia_en_noisy voxforge_en_phone librispeech_phone zamia_en_phone
Traceback (most recent call last):
File "./wav2letter_export.py", line 98, in
w2l_env_activate = config.get("speech", "w2l_env_activate")
File "/usr/lib/python2.7/ConfigParser.py", line 618, in get
raise NoOptionError(option, section)
ConfigParser.NoOptionError: No option 'w2l_env_activate' in section: 'speech'

Data licenses?

Hi Guenter!

I was just checking out the data you link to in this section: https://github.com/gooofy/zamia-speech#speech-corpora

I downloaded the Forschergeist data, but I don't see licensing or info on where it's from... did I miss it?

Very cool work you have here:)

-josh

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

gooofy / zamia-speech Goto Github PK

zamia-speech's Introduction

Zamia Speech

Table of Contents

Download

ASR Models

IPA Dictionaries (Lexicons)

G2P Models

Language Models

Code

Get Started with our Pre-Trained Models

Run Example Applications

Wave File Decoding Demo

Live Mic Demo

Get Started with a Demo STT Service Packaged in Docker

Requirements

Setup Notes

~/.speechrc

tmp directory

Speech Corpora

Adding Artificial Noise or Other Effects

Text Corpora

Language Model

English

German

French

Submission Review and Transcription

Lexica/Dictionaries

Sequitur G2P

Manual Editing

Wiktionary

Kaldi Models (recommended)

English NNet3 Chain Models

German NNet3 Chain Models

Model Adaptation

wav2letter++ models

English Wav2letter Models

German Wav2letter Models

auto-reviews using wav2letter

Audiobook Segmentation and Transcription (Manual)

(0/3) Convert Audio to WAVE Format

(1/3) Convert Audio to 16kHz mono

(2/3) Split Audio into Segments

(3/3) Transcribe Audio

Audiobook Segmentation and Transcription (kaldi)

Directory Layout

(1/4) Preprocess the Transcript

(2/4) Model adaptation

(3/4) Auto-Segment using Kaldi

(4/4) Retrieve Segmentation Result

Training Voices for Zamia-TTS

Tacotron 2

Tacotron

Model Distribution

License

Authors

zamia-speech's People

Contributors

Stargazers

Watchers

Forkers

zamia-speech's Issues

Introduction

Current workflow

Proposed workflow

Recommend Projects

Recommend Topics

Recommend Org

`~/.speechrc`