Git Product home page Git Product logo

speech-enhancement's Introduction

Speech-enhancement


Build Status

Vincent Belz : [email protected]

Published in towards data science : Speech-enhancement with Deep learning

Introduction

This project aims at building a speech enhancement system to attenuate environmental noise.

Spectrogram denoising

Audios have many different ways to be represented, going from raw time series to time-frequency decompositions. The choice of the representation is crucial for the performance of your system. Among time-frequency decompositions, Spectrograms have been proved to be a useful representation for audio processing. They consist in 2D images representing sequences of Short Time Fourier Transform (STFT) with time and frequency as axes, and brightness representing the strength of a frequency component at each time frame. In such they appear a natural domain to apply the CNNS architectures for images directly to sound. Between magnitude and phase spectrograms, magnitude spectrograms contain most the structure of the signal. Phase spectrograms appear to show only little temporal and spectral regularities.

In this project, I will use magnitude spectrograms as a representation of sound (cf image below) in order to predict the noise model to be subtracted to a noisy voice spectrogram.

sound representation

The project is decomposed in three modes: data creation, training and prediction.

Prepare the data

To create the datasets for training, I gathered english speech clean voices and environmental noises from different sources.

The clean voices were mainly gathered from LibriSpeech: an ASR corpus based on public domain audio books. I used as well some datas from SiSec. The environmental noises were gathered from ESC-50 dataset or https://www.ee.columbia.edu/~dpwe/sounds/.

For this project, I focused on 10 classes of environmental noise: tic clock, foot steps, bells, handsaw, alarm, fireworks, insects, brushing teeth, vaccum cleaner and snoring. These classes are illustrated in the image below (I created this image using pictures from https://unsplash.com).

classes of environmental noise used

To create the datasets for training/validation/testing, audios were sampled at 8kHz and I extracted windows slighly above 1 second. I performed some data augmentation for the environmental noises (taking the windows at different times creates different noise windows). Noises have been blended to clean voices with a randomization of the noise level (between 20% and 80%). At the end, training data consisted of 10h of noisy voice & clean voice, and validation data of 1h of sound.

To prepare the data, I recommend to create data/Train and data/Test folders in a location separate from your code folder. Then create the following structure as in the image below:

data folder structure

You would modify the noise_dir, voice_dir, path_save_spectrogram, path_save_time_serie, and path_save_sound paths name accordingly into the args.py file that takes the default parameters for the program.

Place your noise audio files into noise_dir directory and your clean voice files into voice_dir.

Specify how many frames you want to create as nb_samples in args.py (or pass it as argument from the terminal) I let nb_samples=50 by default for the demo but for production I would recommend having 40 000 or more.

Then run python main.py --mode='data_creation'. This will randomly blend some clean voices from voice_dir with some noises from noise_dir and save the spectrograms of noisy voices, noises and clean voices to disk as well as complex phases, time series and sounds (for QC or to test other networks). It takes the inputs parameters defined in args.py. Parameters for STFT, frame length, hop_length can be modified in args.py (or pass it as arguments from the terminal), but with the default parameters each window will be converted into spectrogram matrix of size 128 x 128.

Datasets to be used for training will be magnitude spectrograms of noisy voices and magnitude spectrograms of clean voices.

Training

The model used for the training is a U-Net, a Deep Convolutional Autoencoder with symmetric skip connections. U-Net was initially developed for Bio Medical Image Segmentation. Here the U-Net has been adapted to denoise spectrograms.

As input to the network, the magnitude spectrograms of the noisy voices. As output the Noise to model (noisy voice magnitude spectrogram - clean voice magnitude spectrogram). Both input and output matrix are scaled with a global scaling to be mapped into a distribution between -1 and 1.

Unet training

Many configurations have been tested during the training. For the preferred configuration the encoder is made of 10 convolutional layers (with LeakyReLU, maxpooling and dropout). The decoder is a symmetric expanding path with skip connections. The last activation layer is a hyperbolic tangent (tanh) to have an output distribution between -1 and 1. For training from scratch the initial random weights where set with He normal initializer.

Model is compiled with Adam optimizer and the loss function used is the Huber loss as a compromise between the L1 and L2 loss.

Training on a modern GPU takes a couple of hours.

If you have a GPU for deep learning computation in your local computer, you can train with: python main.py --mode="training". It takes as inputs parameters defined in args.py. By default it will train from scratch (you can change this by turning training_from_scratch to false). You can start training from pre-trained weights specified in weights_folder and name_model. I let available model_unet.h5 with weights from my training in ./weights. The number of epochs and the batch size for training are specified by epochs and batch_size. Best weights are automatically saved during training as model_best.h5. You can call fit_generator to only load part of the data to disk at training time.

Personally, I used the free GPU available at Google colab for my training. I let a notebook example at ./colab/Train_denoise.ipynb. If you have a large available space on your drive, you can load all your training data to your drive and load part of it at training time with the fit_generator option of tensorflow.keras. Personally I had limited space available on my Google drive so I pre-prepared in advanced batches of 5Gb to be loaded to drive for training. Weights were regularly saved and reload for next training.

At the end, I obtained a training loss of 0.002129 and a validation loss of 0.002406. Below a loss graph made in one of the trainings.

loss training

Prediction

For prediction, the noisy voice audios are converted into numpy time series of windows slightly above 1 second. Each time serie is converted into a magnitude spectrogram and a phase spectrogram via STFT transforms. Noisy voice spectrograms are passed into the U-Net network that will predict the noise model for each window (cf graph below). Prediction time for one window once converted to magnitude spectrogram is around 80 ms using classical CPU.

flow prediction part 1

Then the model is subtracted from the noisy voice spectrogram (here I apply a direct subtraction as it was sufficient for my task, we could imagine to train a second network to adapt the noise model, or applying a matching filter such as performed in signal processing). The "denoised" magnitude spectrogram is combined with the initial phase as input for the inverse Short Time Fourier Transform (ISTFT). Our denoised time serie can be then converted to audio (cf graph below).

flow prediction part 2

Let's have a look at the performance on validation datas!

Below I display some results from validation examples for Alarm/Insects/Vaccum cleaner/Bells noise. For each of them I display the initial noisy voice spectrogram, the denoised spectrogram predicted by the network, and the true clean voice spectrogram. We can see that the network is well able to generalize the noise modelling, and produce a slightly smoothed version of the voice spectrogram, very close to the true clean voice spectrogram.

More examples of spectrogram denoising on validation data are displayed in the initial gif on top of the repository.

validation examples

Let's hear the results converted back to sounds:

Audios for Alarm example:

Input example alarm

Predicted output example alarm

True output example alarm

Audios for Insects example:

Input example insects

Predicted output example insects

True output example insects

Audios for Vaccum cleaner example:

Input example vaccum cleaner

Predicted output example vaccum cleaner

True output example vaccum cleaner

Audios for Bells example:

Input example bells

Predicted output example bells

True output example bells

Below I show the corresponding displays converting back to time series:

validation examples timeserie

You can have a look at these displays/audios in the jupyter notebook demo_predictions.ipynb that I provide in the ./demo_data folder.

Below, I show the corresponding gif of the spectrogram denoising gif (top of the repository) in the time serie domain.

Timeserie denoising

As an extreme testing, I applied to some voices blended with many noises at a high level. The network appeared to work surprisingly well for the denoising. The total time to denoise a 5 seconds audio was around 4 seconds (using classical CPU).

Below some examples:

Example 1:

Input example test 1

Predicted output example test 1

Example 2:

Input example test 2

Predicted output example test 2

How to use?

- Clone this repository
- pip install -r requirements.txt
- python main.py OPTIONS

* Modes of the program (Possible OPTIONS):

--mode: default='prediction', type=str, choices=['data_creation', 'training', 'prediction']

Have a look at possible arguments for each option in args.py.

References

Jansson, Andreas, Eric J. Humphrey, Nicola Montecchio, Rachel M. Bittner, Aparna Kumar and Tillman Weyde.Singing Voice Separation with Deep U-Net Convolutional Networks. ISMIR (2017).

[https://ejhumphrey.com/assets/pdf/jansson2017singing.pdf]

Grais, Emad M. and Plumbley, Mark D., Single Channel Audio Source Separation using Convolutional Denoising Autoencoders (2017).

[https://arxiv.org/abs/1703.08019]

Ronneberger O., Fischer P., Brox T. (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Navab N., Hornegger J., Wells W., Frangi A. (eds) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. MICCAI 2015. Lecture Notes in Computer Science, vol 9351. Springer, Cham

[https://arxiv.org/abs/1505.04597]

K. J. Piczak. ESC: Dataset for Environmental Sound Classification. Proceedings of the 23rd Annual ACM Conference on Multimedia, Brisbane, Australia, 2015.

[DOI: http://dx.doi.org/10.1145/2733373.2806390]

License

License

speech-enhancement's People

Contributors

vbelz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

speech-enhancement's Issues

Error (tensorflow)

I am getting the following error for tensorflow module:

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

please help me out to get rid of this error.

Thank you.

Python 2 or 3 ?

Hi, should we use python2.x or python 3.x in order to make it run?
I ask you because I am going through some issues installing the requirements and maybe it has to do with the python version.

Thanks for your work!

global scaling

Hey @vbelz,
first, thank you for sharing this project, it helps me a lot!
There is one thing I didn't understand and that's the global scaling of matrix_spec (and inverse global scaling).
How did you choose the numbers for scaling? and why there is different scaling for X_in and X_ou?

Question on Error of Invalid Instruction (core dumped)

Hey @vbelz I had a question :
While running the python main.py --mode='data_creation' I get the error Invalid Instruction (core dumped)
I guess it because of the tensorflow version (1.15.2) as my cpu does not support AVX
But it would not give an error if I use tensorflow version (1.5)

If I want to use the same version of tensorflow that is 1.15.2 what could be an alternative?

Question about the inputs and the outputs of the model

Hey,
First of all your code is great! it worked for me and it is very simple and clear πŸ‘
One question - in your model you used Xin to be spectogram(noisy_voice) and Xout is spectogram(noisy_voice) - spectogram(voice). I didn't understand why did you do the substruction so I tried to take Xout to be spectogram(voice), but then I got underfitted loss. Do you know why that happens?

Thanks again!
Olga :)

a little confusion

Hey!
I have some confusion about the computing process. The input audio whose size is 112501KB, gets an output of 112486KB. Could you tell me the reason and the operations about the audio throughout the prediction?

Thank you

Inference pipeline

Hello,

The model does a great job of removing the noise. However I notice that the speech quality is degraded.

For testing, I changed X_denoise = m_amp_db_audio - inv_sca_X_pred[:,:,:,0] to X_denoise = m_amp_db_audio.

I was expecting the original audio file. FYI, my input is a mono channel 16000 wav file. Can you please help me. I am guessing I need to change some parameters other than the sample rate.

MemoryError

Hello blogger, I encountered an error: MemoryError: Unable to allocate array with shape (20000, 128, 128) and data, where I changed 40000 to 20000, but there is still this issue. I would like to ask if this is due to excessive training set data or NB_ The samples are too large

The lack of documentation

hello
I really need your helpppppp.
When i run the main.py, It has error [Errno 2] No such file or directory: './Train/sound/noisy_voice_long.wav'
I want to know what's "noisy_voice_long.wav / noise_long.wav / voice.wav ” and how do i get it.
plz answer me

Parser takes only first character from the filename and says β€œFile not found”

I know it's very difficult to understand my issue, but I'll try my best to explain.
So I've cloned a repository from Github and working on it.
When I run the program without any arguments, it works fine
python main.py --audio_input_prediction works fine.
But when I try to pass my own file, it shows an error.
python main.py --audio_input_prediction myaudio.wav shows an error saying "FileNotFoundError: [Errno 2] No such file or directory: '<File Path/m'"
Notice how it only takes the first character from my argument?
In the code, for default mode, it's something like:
(args.py)parser.add_argument('--audio_input_prediction', default=['default_audio.wav'], type=list) and it works fine.
So naturally, I tried to add '[]' to my file name
python main.py --audio_input_prediction [myaudio.wav] Shows an error too which says "FileNotFoundError: [Errno 2] No such file or directory: '<File Path/['"
See, here it took only the first character of my provided argument i.e. '['
And the file IS there. No spelling mistakes in file name either. Any help would be very much appreciated.

In my conclusion, the issue is in args.py, specifically in this line `parser.add_argument('--audio_input_prediction', default=['noisy_voice_long_t2.wav'], type=list)'.
I even tried to change the type to 'str', but still, I got the same error

How to train for a different audio sampling rate?

Hi vbelz,

I am wondering what changes need to be made in order to train for a different audio sampling rate, e.g. at 44100 Hz?

I assume both the model and some parameters in args.py need to be modified. Can you please share some insights on this?

Thanks,
Tony

Is this project unsupervised learning?

Hi,@vbelz, I have a problem.
Is this project unsupervised learning? There are 10 kinds of noise collected in this project. I originally thought that 10 models could be seen in the weight folder, but I only saw two models, model_best and model_unet.

thanks.

Update requirements.txt and there is no json file outputed

Hello there!
So I had a journey installing all dependencys and python it would be great if you update the requirements.txt file! Iam guessing that my problem is cussed by the worng version of tensorflow.
But my problem is that after training there won't be any .json file outputed in any folder
It would be great if you help me πŸ˜ƒ

[BUG]: Validation against test data

Training error

At line 60, as mentioned here you're validating against test data while training ? Isn't it supposed to be train data?

python3 history = generator_nn.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, shuffle=True, callbacks=[checkpoint], verbose=1, validation_data=(X_test, y_test))

General questions

HI @vbelz ,

First of all, thankyou for your work, I have tried to denoise some audio and it worked so good, but I have a few questions

Quoted from README:

Specify how many frames you want to create as nb_samples in args.py (or pass it as argument from the terminal) I let nb_samples=50 by default for the demo but for production I would recommend having 40 000 or more.

1. What is exactly nb_samples?

2. Are the weights provided by you from nb_samples=50?

3. Should I resample audio to be 8KHz for denoising or is it done inside the network? Also, should I do it for training?

4. I want to twerk it to be a better denoiser for background noise rather than specific sounds. What are your thoughts on this? I have a dataset with clean samples and background noise samples. Will it work if it train it? Which hyperparameter should I use?

Thank you so much and sorry for bothering you!

Why extracted windows are slightly above 1 second?

First of all, thank you so much for this repository. I am doing some research in the speech domain, and this has been very helpful.

But, I have some doubts regarding the same.

  1. Why extracted windows are slightly above 1 second and not exactly 1 second?
  2. Can this 1 second be increased to more number of seconds? How will this affect the training?

Thanks in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.