Git Product home page Git Product logo

ctc-asr's Introduction

End-to-End Speech Recognition System Using Connectionist Temporal Classification

Automatic speech recognition (ASR) system implementation that utilizes the connectionist temporal classification (CTC) cost function. It's inspired by Baidu's Deep Speech: Scaling up end-to-end speech recognition and Deep Speech 2: End-to-End Speech Recognition in English and Mandarin papers. The system is trained on a combined corpus, containing 900+ hours. It achieves a word error rate (WER) of 12.6% on the test dataset, without the use of an external language model.

Contents

Deep Speech 1 and 2 network architectures

(a) shows the Deep Speech (1) model and (b) a version of the Deep Speech 2 model architecture.

Installation

The system was tested on Arch Linux and Ubuntu 16.04, with Python version 3.5+ and the 1.12.0 version of TensorFlow. It's highly recommended to use TensorFlow with GPU support for training.

Arch Linux

# Install dependencies.
sudo pacman -S sox python-tensorflow-opt-cuda tensorbaord

# Install optional dependencies. LaTeX is only required to plot nice looking graphs.
sudo pacman -S texlive-most

# Clone reposetory and install Python depdendencies.
git clone https://github.com/mdangschat/ctc-asr.git
cd speech
git checkout <release_tag>

# Setup optional virtual environment.
pip install -r requirements.txt

Ubuntu

Be aware that the requirements.txt file lists tensorflow as dependency, if you install TensorFlow through pip consider removing it as dependency and install tensorflow-gpu instead. It could also be worth it to build TensorFlow from source.

# Install dependencies.
sudo apt install python3-tk sox libsox-fmt-all

# Install optional dependencies. LaTeX is only required to plot nice looking graphs.
sudo apt install texlive

# Clone reposetory and install Python depdendencies. Don't forget to use tensorflow-gpu.
git clone https://github.com/mdangschat/ctc-asr.git
cd speech
git checkout <release_tag>

# Setup optional virtual environment.
pip3 install -r requirements.txt

Configuration

The network architecture and training parameters can be configured by adding the appropriate flags or by directly editing the asr/params.py configuration file. The default configuration requires quite a lot of VRAM (about 16 GB), consider reducing the number of units per layer (num_units_dense, num_units_rnn) and the amount of RNN layers (num_layers_rnn).

Corpus

There is list of some free speech corpora at the end of this section. However, the corpus is not part of this repository and has to be acquired by each user. For a quick start there is the speech-corpus-dl helper, that downloads a few free corpora, prepares the data and creates a merged corpus.

All audio files have to be 16 kHz, mono, WAV files. For my trainings, I removed examples shorter than 0.7 and longer than 17.0 seconds. Additionally, TEDLIUM examples with labels of fewer than 5 words have also been removed.

The following tree shows a possible structure for the required directories:

./ctc-asr
├── asr
    ├── [...]
├── LICENSE
├── README.md
├── requirements.txt
├── testruns.md
./ctc-asr-checkpoints
└── 3c2r2d-rnn
    ├── [...]
./speech-corpus
├── cache
├── corpus
│   ├── cvv2
│   ├── LibriSpeech
│   ├── tatoeba_audio_eng
│   └── TEDLIUM_release2
├── corpus.json
├── dev.csv
├── test.csv
└── train.csv

Assuming that this repository is cloned into some/folder/ctc-asr, then by default the CSV files are expected to be in some/folder/speech-corpus and the audio files in some/folder/speech-corpus/corpus. TensorFlow checkpoints are written into some/folder/ctc-asr-checkpoints. Both folders (ctc-asr-checkpoints and speech-corpus) must exist, they can be changed in the asr/params.py file.

CSV

The CSV files (e.g. train.csv) have the following format:

path;label;length
relative/path/to/example;lower case transcription without puntuation;3.14159265359
[...]

Where path is the relative WAV path from the DATA_DIR/corpus/ directory (String). By default, label is the lower case transcription without punctuation (String). Finally, length is the audio length in seconds (Float).

Free Speech Corpora

Corpus Statistics

ipython python/dataset/word_counts.py 
Calculating statistics for /home/gpuinstall/workspace/ctc-asr/data/train.csv
Word based statistics:
        total_words = 10,069,671
        number_unique_words = 81,161
        mean_sentence_length = 14.52 words
        min_sentence_length = 1 words
        max_sentence_length = 84 words
        Most common words:  [('the', 551055), ('to', 306197), ('and', 272729), ('of', 243032), ('a', 223722), ('i', 192151), ('in', 149797), ('that', 146820), ('you', 144244), ('it', 118133)]
        27416 words occurred only 1 time; 37,422 words occurred only 2 times; 49,939 words occurred only 5 times; 58,248 words occurred only 10 times.

Character based statistics:
        total_characters = 52,004,043
        mean_label_length = 75.00 characters
        min_label_length = 2 characters
        max_label_length = 422 characters
        Most common characters: [(' ', 9376326), ('e', 5264177), ('t', 4205041), ('o', 3451023), ('a', 3358945), ('i', 2944773), ('n', 2858788), ('s', 2624239), ('h', 2598897), ('r', 2316473), ('d', 1791668), ('l', 1686896), ('u', 1234080), ('m', 1176076), ('w', 1052166), ('c', 999590), ('y', 974918), ('g', 888446), ('f', 851710), ('p', 710252), ('b', 646150), ('v', 421126), ('k', 387714), ('x', 62547), ('j', 61048), ('q', 34558), ('z', 26416)]
        Most common characters: [' ', 'e', 't', 'o', 'a', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'u', 'm', 'w', 'c', 'y', 'g', 'f', 'p', 'b', 'v', 'k', 'x', 'j', 'q', 'z']

Usage

Training

Start training by invoking asr/train.py. Use asr/train.py -- --delete to start a clean run and remove the old checkpoints. Please note that all commands are expected to be executed from the projects root folder. The additional -- before the actual flags begin is used to indicate the end of IPython flags.

The training progress can be monitored using Tensorboard. To start Tensorboard use tensorboard --logdir <checkpoint_directory>. By default it can then be accessed via localhost:6006.

Evaluation

Evaluate the current model by invoking asr/evaluate.py. Use asr/evaluate.py -- --dev to run on the development dataset, instead of the test set.

Prediction

To evaluate a given 16 kHz, mono WAV file use asr/predict.py --input <wav_path>.

ctc-asr's People

Contributors

mdangschat avatar yweweler avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ctc-asr's Issues

Update Documentation

  • Directories:
    • Point out the required speech_checkpoints and speech-corpus dirs.
    • Remember to update the tree output.
  • CSV: Add information about the required CSV format to README.md. (#8)
  • Reference the speech-corpus-dl git.
  • reset params.py and validate default params. (#10)

Error with Output:Node Name for freezing the graph.

Hi, I am trying to freeze the graph but when I use "bazel-bin/tensorflow/tools/graph_transforms/summarize_graph --in_graph=/path_to_file/graph.pbtxt", I get this:
How do I know which one to use amongst this?

No inputs spotted.
Found 36 variables: (name=global_step, type=int64(9), shape=[]) (name=conv/conv2d/kernel, type=float(1), shape=[11,41,1,32]) (name=conv/conv2d/bias, type=float(1), shape=[32]) (name=conv/conv2d_1/kernel, type=float(1), shape=[11,21,32,32]) (name=conv/conv2d_1/bias, type=float(1), shape=[32]) (name=conv/conv2d_2/kernel, type=float(1), shape=[11,21,32,96]) (name=conv/conv2d_2/bias, type=float(1), shape=[96]) (name=rnn/cudnn_lstm/opaque_kernel, type=float(1), shape=) (name=dense4/dense/kernel, type=float(1), shape=[4096,2048]) (name=dense4/dense/bias, type=float(1), shape=[2048]) (name=logits/dense/kernel, type=float(1), shape=[2048,29]) (name=logits/dense/bias, type=float(1), shape=[29]) (name=beta1_power, type=float(1), shape=[]) (name=beta2_power, type=float(1), shape=[]) (name=conv/conv2d/kernel/Adam, type=float(1), shape=[11,41,1,32]) (name=conv/conv2d/kernel/Adam_1, type=float(1), shape=[11,41,1,32]) (name=conv/conv2d/bias/Adam, type=float(1), shape=[32]) (name=conv/conv2d/bias/Adam_1, type=float(1), shape=[32]) (name=conv/conv2d_1/kernel/Adam, type=float(1), shape=[11,21,32,32]) (name=conv/conv2d_1/kernel/Adam_1, type=float(1), shape=[11,21,32,32]) (name=conv/conv2d_1/bias/Adam, type=float(1), shape=[32]) (name=conv/conv2d_1/bias/Adam_1, type=float(1), shape=[32]) (name=conv/conv2d_2/kernel/Adam, type=float(1), shape=[11,21,32,96]) (name=conv/conv2d_2/kernel/Adam_1, type=float(1), shape=[11,21,32,96]) (name=conv/conv2d_2/bias/Adam, type=float(1), shape=[96]) (name=conv/conv2d_2/bias/Adam_1, type=float(1), shape=[96]) (name=rnn/cudnn_lstm/opaque_kernel/Adam, type=float(1), shape=) (name=rnn/cudnn_lstm/opaque_kernel/Adam_1, type=float(1), shape=) (name=dense4/dense/kernel/Adam, type=float(1), shape=[4096,2048]) (name=dense4/dense/kernel/Adam_1, type=float(1), shape=[4096,2048]) (name=dense4/dense/bias/Adam, type=float(1), shape=[2048]) (name=dense4/dense/bias/Adam_1, type=float(1), shape=[2048]) (name=logits/dense/kernel/Adam, type=float(1), shape=[2048,29]) (name=logits/dense/kernel/Adam_1, type=float(1), shape=[2048,29]) (name=logits/dense/bias/Adam, type=float(1), shape=[29]) (name=logits/dense/bias/Adam_1, type=float(1), shape=[29])
Found 59 possible outputs: (name=global_step/read, op=Identity) (name=global_step/cond/switch_t, op=Identity) (name=global_step/cond/switch_f, op=Identity) (name=global_step/add, op=Add) (name=seed2, op=Select) (name=IteratorToStringHandle, op=IteratorToStringHandle) (name=rnn/cudnn_lstm/Identity, op=Identity) (name=rnn/cudnn_lstm/zeros/Less, op=Less) (name=rnn/cudnn_lstm/zeros_1/Less, op=Less) (name=dense4/dense/kernel/Regularizer/l2_regularizer, op=Mul) (name=dense_to_sparse/Shape, op=Shape) (name=gradients/zeros_like, op=ZerosLike) (name=gradients/dense4/dropout/dropout/mul_grad/tuple/control_dependency_1, op=Identity) (name=gradients/dense4/dropout/dropout/truediv_grad/tuple/control_dependency_1, op=Identity) (name=gradients/dense4/Minimum_grad/tuple/control_dependency_1, op=Identity) (name=gradients/zeros_like_3, op=ZerosLike) (name=gradients/rnn/cudnn_lstm/CudnnRNN_grad/tuple/control_dependency_1, op=Identity) (name=gradients/rnn/cudnn_lstm/CudnnRNN_grad/tuple/control_dependency_2, op=Identity) (name=gradients/conv/Minimum_2_grad/tuple/control_dependency_1, op=Identity) (name=gradients/conv/Minimum_1_grad/tuple/control_dependency_1, op=Identity) (name=gradients/conv/Minimum_grad/tuple/control_dependency_1, op=Identity) (name=gradients/conv/conv2d/Conv2D_grad/tuple/control_dependency, op=Identity) (name=conv/conv2d/kernel/Adam/read, op=Identity) (name=conv/conv2d/kernel/Adam_1/read, op=Identity) (name=conv/conv2d/bias/Adam/read, op=Identity) (name=conv/conv2d/bias/Adam_1/read, op=Identity) (name=conv/conv2d_1/kernel/Adam/read, op=Identity) (name=conv/conv2d_1/kernel/Adam_1/read, op=Identity) (name=conv/conv2d_1/bias/Adam/read, op=Identity) (name=conv/conv2d_1/bias/Adam_1/read, op=Identity) (name=conv/conv2d_2/kernel/Adam/read, op=Identity) (name=conv/conv2d_2/kernel/Adam_1/read, op=Identity) (name=conv/conv2d_2/bias/Adam/read, op=Identity) (name=conv/conv2d_2/bias/Adam_1/read, op=Identity) (name=cond/switch_t, op=Identity) (name=cond/switch_f, op=Identity) (name=zeros, op=Fill) (name=rnn/cudnn_lstm/opaque_kernel/Adam/cond/switch_t, op=Identity) (name=rnn/cudnn_lstm/opaque_kernel/Adam/cond/switch_f, op=Identity) (name=rnn/cudnn_lstm/opaque_kernel/Adam/read, op=Identity) (name=cond_1/switch_t, op=Identity) (name=cond_1/switch_f, op=Identity) (name=zeros_1, op=Fill) (name=rnn/cudnn_lstm/opaque_kernel/Adam_1/cond/switch_t, op=Identity) (name=rnn/cudnn_lstm/opaque_kernel/Adam_1/cond/switch_f, op=Identity) (name=rnn/cudnn_lstm/opaque_kernel/Adam_1/read, op=Identity) (name=dense4/dense/kernel/Adam/read, op=Identity) (name=dense4/dense/kernel/Adam_1/read, op=Identity) (name=dense4/dense/bias/Adam/read, op=Identity) (name=dense4/dense/bias/Adam_1/read, op=Identity) (name=logits/dense/kernel/Adam/read, op=Identity) (name=logits/dense/kernel/Adam_1/read, op=Identity) (name=logits/dense/bias/Adam/read, op=Identity) (name=logits/dense/bias/Adam_1/read, op=Identity) (name=Adam, op=AssignAdd) (name=concat, op=ConcatV2) (name=concat_1, op=ConcatV2) (name=Merge/MergeSummary, op=MergeSummary) (name=save/Identity, op=Identity)

Input & output of graph

I wanted to know what are the input & output nodes of the graph generated in your code. Could you please provide me this information?
Thank you in advance

how cant i train with my dataset?

i have dataset: 1 folder 'wav' (.wav file), 1 text file have lines = num of wav file with format name_wav text_of_wav
so, how can i train with this data. thanks so much,, im beginer

Inference garph

Hello,
I just wanted to know where you are saving the .pbtxt file? I noticed your code creates this graph file but I am not able to locate the code snippet for creating it.
Thanks in advance

Value of Beam Width

Hi
Could I have some info on how beam width is chosen as 1024? What is the role of beam width parameter? I have a confusion regarding this parameter.

About models

Hello, can I have a trained model that you don't need? The computing ability of my computer is relatively poor. I want to test the results of the model and then consider training with the cloud services.

Common Voice Dataset

Hi,

I just wanted to know if all the datasets you have used are clean speech? Specifically, wondering about common voice dataset, by any chance have you analyzed the dataset? Since, they have a platform for recording, a mobile app as well as a browser platform, I feel there is a chance that the recordings can be noisy.

Thank you

Configuration for low memory GPU

I use laptop with 2GB GPU Memory (Nvidia MX150).

I try to build new language model, so i try many source code from deepspeech, pytorch, etc...

to make my laptop capable handle the process. i set the another source code with low batch and number of n_hidden. I already try to reduce the batch to 1 and_number units_rnn to 1024, but your code still insufied GPU memory...

do you have any recommendation of the setting?

command that i use:
python3 asr/train.py -- --used_model ds2 --rnn_cell rnn_relu --feature_type mfcc --batch_size 1 --max_epochs 15 --cudnn True --allow_vram_growth True --num_units_rnn 1024 --delete tensorboard learning_rate 0.00001

Issue with input and output names

Hello, I am currently trying to freeze the graph from this model and I am unable to do so because when I am inspecting the "graph.pbtxt" created after training, there is no node with the name of "logits/dense".

Please help me figure out what the output node name is so I can freeze the graph to .pb.

Thank you
Regard
Rahul B

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.