filippogiruzzi / voice_activity_detection Goto Github PK

Voice Activity Detection based on Deep Learning & TensorFlow

License: GNU General Public License v3.0

Python 94.08% Dockerfile 2.99% Makefile 1.98% Shell 0.95%

voice-activity-detection deep-learning speech tensorflow time-series time-series-classification resnet speech-recognition speech-detection python mfcc-features machine-learning vad deeplearning artificial-intelligence deep-neural-networks librispeech librispeech-dataset

voice_activity_detection's Introduction

Voice Activity Detection project

Keywords: Python, TensorFlow, Deep Learning, Time Series classification

Installation
1.1 Basic installation
1.2 Virtual environment installation
1.3 Docker installation
Introduction
2.1 Goal
2.2 Results
Project structure
Dataset
Project usage
5.1 Dataset automatic labeling
5.2 Record raw data to .tfrecord format
5.3 Train a CNN to classify Speech & Noise signals
5.4 Export trained model & run inference on Test set
Todo
Resources

1. Installation

This project was designed for:

Ubuntu 20.04
Python 3.7.3
TensorFlow 1.15.4

$ cd /path/to/project/
$ git clone https://github.com/filippogiruzzi/voice_activity_detection.git
$ cd voice_activity_detection/

1.1 Basic installation

⚠️ It is recommended to use virtual environments !

$ pyenv install 3.7.3
$ pyenv virtualenv 3.7.3 vad-venv
$ pyenv activate vad-venv

$ pip install -r requirements.txt
$ pip install -e .

1.2 Virtual environment installation

1.3 Docker installation

You can pull the latest image from DockerHub and run Python commands inside the container:

$ docker pull filippogrz/tf-vad:latest
$ docker run --rm --gpus all -v /var/run/docker.sock:/var/run/docker.sock -it --entrypoint /bin/bash -e TF_FORCE_GPU_ALLOW_GROWTH=true filippogrz/tf-vad

If you want to build the docker image and run the container from scratch, run the following commands.

Build the docker image:

$ make build

(This might take a while.)

Run the docker image:

$ make local-nobuild

2. Introduction

2.1 Goal

The purpose of this project is to design and implement a real-time Voice Activity Detection algorithm based on Deep Learning.

The designed solution is based on MFCC feature extraction and a 1D-Resnet model that classifies whether a audio signal is speech or noise.

2.2 Results

Model	Train acc.	Val acc.	Test acc.
1D-Resnet	99 %	98 %	97 %

Raw and post-processed inference results on a test audio signal are shown below.

3. Project structure

The project voice_activity_detection/ has the following structure:

vad/data_processing/: raw data labeling, processing, recording & visualization
vad/training/: data, input pipeline, model & training / evaluation / prediction
vad/inference/: exporting trained model & inference

4. Dataset

Please download the LibriSpeech ASR corpus dataset from https://openslr.org/12/, and extract all files to : /path/to/LibriSpeech/.

The dataset contains approximately 1000 hours of 16kHz read English speech from audiobooks, and is well suited for Voice Activity Detection.

I automatically annotated the test-clean set of the dataset with a pretrained VAD model.

Please feel free to use the labels/ folder and the pre-trained VAD model (only for inference) from this link .

5. Project usage

$ cd /path/to/project/voice_activity_detection/vad/

5.1 Dataset automatic labeling

Skip this subsection if you already have the labels/ folder, that contains annotations from a different pre-trained model.

$ python data_processing/librispeech_label_data.py --data-dir /path/to/LibriSpeech/test-clean/ --exported-model /path/to/pretrained/model/

This will record the annotations into /path/to/LibriSpeech/labels/ as .json files.

5.2 Record raw data to .tfrecord format

$ python data_processing/data_to_tfrecords.py --data-dir /path/to/LibriSpeech/

This will record the splitted data to .tfrecord format in /path/to/LibriSpeech/tfrecords/

5.3 Train a CNN to classify Speech & Noise signals

$ python training/train.py --data-dir /path/to/LibriSpeech/tfrecords/

5.4 Export trained model & run inference on Test set

$ python inference/export_model.py --model-dir /path/to/trained/model/dir/
$ python inference/inference.py --data-dir /path/to/LibriSpeech/ --exported-model /path/to/exported/model/ --smoothing

The trained model will be recorded in /path/to/LibriSpeech/tfrecords/models/resnet1d/. The exported model will be recorded inside this directory.

6. Todo

Compare Deep Learning model to a simple baseline
Train on full dataset
Improve data balancing
Add time series data augmentation
Study ROC curve & classification threshold
Add online inference
Evaluate quantitatively post-processing methods on the Test set
Add model description & training graphs
Add Google Colab demo

7. Resources

Voice Activity Detection for Voice User Interface, Medium
Deep learning for time series classifcation: a review, Fawaz et al., 2018, Arxiv
Time Series Classification from Scratch with Deep Neural Networks: A Strong Baseline, Wang et al., 2016, Arxiv

voice_activity_detection's People

Contributors

Stargazers

Watchers

voice_activity_detection's Issues

Does anyone know how to train the model with my own dataset?

I have been trying to train the model with a new dataset that contains .WAV but it requires me to use the model that was used for .FLAC. Thus, I have been wondering how can I do this?

where to find pretrained model?

Hi, is there a pretrain VAD model to recommend?

I always get errors on the path when I run in windows

an example

how to predict on batch input?

i can run inference.py successfully, but it's a bit slow when predicting one block by one on long audio.

How can i predict on batch input?

thanks.

Performance issue in /vad/training (by P3)

Hello! I've found a performance issue in input_pipeline.py: dataset.batch(batch_size)(line 70) should be called before dataset.map(parse_func, num_parallel_calls=8)(line 69), which could make your program more efficient.

Here is the tensorflow document to support it.

Besides, you need to check the function parse_func called in dataset.map(parse_func, num_parallel_calls=8) whether to be affected or not to make the changed code work properly. For example, if parse_func needs data with shape (x, y, z) as its input before fix, it would require data with shape (batch_size, x, y, z) after fix.

Looking forward to your reply. Btw, I am very glad to create a PR to fix it if you are too busy.

Did anyone have run this program in Windows?

I'm sincerely looking for some experience. Thank you!

Label

Could I know how to produce the label for own datasets?

librosa error

Hello,thank you for your work, when running python3 data_processing/data_to_tfrecords.py --data_dir /path/to/LibriSpeech/ on your provided test-clean

got this error
librosa.util.exceptions.ParameterError: Since S.shape[0] is 128, frame_length is expected to be 254 or 255; found 512

any suggesion? tf=1.14 cant find 1.12.0 if this matter. Thank you.

Labels for LibriSpheech

Hi,
you wrote: "I automatically annotated the test-clean set of the dataset with a pretrained VAD model", my question is -
what pretrained-model did you use to get the labels of the LibriSpeech dataset?

Thank you!

Who someone else is having this problem?

I have run it on Ubuntu and Windows, however, it appears this error.

Can someone help with that?

Can not read tfrecords

I was trying to run the train.py using the label provided in the repo, but seems could not read in the features, logs are like:

021-04-06 15:14:51.759823: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: subsegment/features. Can't parse serialized Example.
2021-04-06 15:14:51.759847: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: subsegment/features. Can't parse serialized Example.
2021-04-06 15:14:51.759867: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: subsegment/features. Can't parse serialized Example.
2021-04-06 15:14:51.759831: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: subsegment/features. Can't parse serialized Example.
2021-04-06 15:14:51.759895: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: subsegment/features. Can't parse serialized Example.

Plz help. Thx

about inference

can i use this model to detect voice in audio with music and noice?

hello, every one.

can i use this model to detect voice in audio with music and noice?

Link to pre-trained models

Hi!
Would it be possible to make the pre-trained models available?
Thanks!

Having problem when "Restoring parameters" at the start of training

Hello,

I got a segmentation fault when trying to read the "model.cpkt-0". Seems did not produce one. The log is:

INFO:tensorflow:Restoring parameters from /voice_activity_detection-master/data/LibriSpeech/tfrecordsmodels/resnet1d/2021-04-06T10:21:18.846963/model.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
Segmentation fault (core dumped)

Pls help me. Thx