Live Testing for Arabic Speech Command Recognition

Live testing app for this Arabic keyword spotting model.

Quickstart

$ virtualenv .env
$ source .env/bin/activate
$ pip install -r requirements.txt
$ python main.py

Dataset

Original Dataset: https://github.com/abdulkaderghandoura/arabic-speech-commands-dataset

The dataset is a list of pairs (x, y) where x is the input speech signal, and y is the corresponding keyword. The final dataset consists of 12000 such pairs, comprising 40 keywords. Each audio file is one second in length sampled at 16 kHz. There were 30 participants, each of them recorded 10 utterances for each keyword. Therefore, we have 300 audio files for each keyword in total (30 × 10 × 40 = 12000). Lastly, the total size of the dataset is ~384 MB. The table below lists the 40 chosen keywords with their translations into Arabic and pronunciations in the International PhoneticAlphabet (IPA):

As commonly done in machine learning settings, we split the dataset into three subsets: training, validation, and testing.

Considering the number of instances in our dataset, we decided to keep 80% of them as the training set, 10% as the validation set, and the remaining 10% are kept as a hold-out testing set.

In our split method, we guarantee that all recordings of a certain contributor are within the same subset. In this way, we avoid having signals with some similarities in both the training and validation/testing sets, as this may affect the validity of the results. Besides, it makes sure that the model will learn to generalize to new people outside of our dataset.

Data Augmentation

We used audiomentations for the augmentation tasks.

We combined and used several data augmentation techniques over 10 rounds with a probability of 0.5 for each augmentation to make up for the low volume of data:

Add gaussian noise to the samples
Time stretch the signal without changing the pitch
Shift the samples forwards or backwards
Frequency masking
Time masking

We also added 3000 silent segments with some Gaussian noise to the dataset to be able to detect silence.

Data Preprocessing

MFCCs are one of the most widely used features to represent speech signals in ASR systems. Although it is not the only one, it is known to help achieve remarkable results compared to other features, and this prompted us to use it in our experiments.

Model

A convolutional neural network of 3 stacked convolutional layers with 64, 32, and 32 channels (feature maps), respectively. Each layer is followed by batch normalization and a 2 × 2 max-pooling layer. Finally, these layers are succeeded by a dropout layer with 0.3 omission probability and a fully connected feed-forward layer with 64 hidden units.

abdelhakeem / ascdtest Goto Github PK

ascdtest's Introduction

Live Testing for Arabic Speech Command Recognition

Quickstart

Dataset

Data Augmentation

Data Preprocessing

Model

ascdtest's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent