Character level speech recognizer using ctc loss with deep rnns in TensorFlow.
###About
This is an ongoing project, working towards an implementation of the charater-level ISR detailed in the paper by Kyuyeon Hwang and Wonyong Sung. It works at the character level using 1 deep rnn trained with ctc loss for the acoustic model, and one deep rnn trained for a character-level language model. The acoustic model reads in log mel frequency filterbank feature vectors (40-dim inputs).
The audio signal processing is done using librosa.
Currently only the acoustic model has been completed and it still lack a good trained example. One pre-trained example is available here and can be tried on any file (your own recorded voice for example).
The character-level language model is still in the works.
###Data
The datasets currently supported are :
- LibriSpeech by Vassil Panayotov
- Shtooka
- Vystadial 2013
- TED-LIUM
The data is fed through two pipelines, one for testing, and the other for training.
###How to Run ####Install dependencies #####Required
- TensorFlow (>= 0.12RC1)
- librosa
Install TensorFlow by following the website documentation. GPU support is not mandatory but strongly recommended if you intend to train the RNN.
Install other required dependencies by running :
pip3 install -r requirements.txt
#####Optional
- sox (for live transcript only, install with
sudo apt-get install sox
orbrew install sox --with-flac
) - libcupti (for timeline only, install with :
sudo apt-get install libcupti-dev
) - pyaudio (for live transcript only, install with :
sudo apt-get install python3-pyaudio
)
####Run data preparation Script
I've prepared a bash script to download LibriSpeech (~700mb) and extract the data to the right place :
$ chmod +x prepare_data.sh
$ ./prepare_data.sh
It will remove the tar files after downloading and unzipping.
All hyper parameters for the network are defined in config.ini
. A different config file can be fed to the training
program using something like:
$ python stt.py --config_file="different_config_file.ini"
You should ensure it follows the same format as the one provided.
####Running Optimizer Once your dependencies are set up, and data is downloaded and extracted into the appropriate location, the optimizer can be started by doing :
$ python stt.py --train
Dynamic RNNs are used as memory consumption on the entirely unrolled network was massive, and the model would take 30 minutes to build. Unfortunately this comes at a cost to speed, but I think in this case the tradeoff is worth it (as the model can now fit on a single GPU).
####Running the network You can also use a trained network to process a wav file
$ python stt.py --file "path_to_file.wav"
The result will be printed on standard input. At this time only the acoustic model will process so the result can be weird.
####Analysing performance
You can add the --timeline
option in order to produce a timeline file and see how everything is going.
The resulting file will be overridden at each step. It can be opened with Chrome, opening chrome://tracing/
and
loading the file.
###Project Road Map
With verification and testing performed somewhere at every step:
Build character-level RNN code- Add ctc beam search
- Wrap acoustic model and language model into general 'Speech Recognizer'
- Add ability for human to sample and test
Ultimately I'd like to work towards bridging this with my other project neural-chatbot to make an open-source natural conversational engine.
###License
MIT
###References
"LibriSpeech: an ASR corpus based on public domain audio books", Vassil Panayotov, Guoguo Chen, Daniel Povey andSanjeev Khudanpur, ICASSP 2015
http://shtooka.net
####Vystadial 2013
Korvas, Matěj; Plátek, Ondřej; Dušek, Ondřej; Žilka, Lukáš and Jurčíček, Filip, 2014, Vystadial 2013 – Czech data,
LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague,
http://hdl.handle.net/11858/00-097C-0000-0023-4670-6.
A. Rousseau, P. Deléglise, and Y. Estève, "Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks",
in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), May 2014.