Git Product home page Git Product logo

cnns-speech-music-discrimination's Introduction

CNNs:Speech-Music-Discrimination

@article{papakostas2018speech, title={Speech-Music Discrimination Using Deep Visual Feature Extractors}, author={Papakostas, Michalis and Giannakopoulos, Theodoros}, journal={Expert Systems with Applications}, year={2018}, publisher={Elsevier} }

Synopsis

This project describes a new approach to the very traditional problem of Speech-Music Discrimination. According to our knowledge, the proposed method, provides state-of-the-art results on the task. We employ a Deep Convolutional Neural Network (CNN) and we offer a compact framework to perform segmentation and binary (Speech/Music) classification, by exploing the benefits of transfering knowledge from pretrained architectures on Imagenet. Our method is unchained from traditional audio features, which offer inferior results on the task. Instead it exploits the highly invariant features produced by CNNs and opperates on pseudocolored RGB or grayscale frequency-images, which represent audio segments.

Evaluation of different methods on 11-hours of continous radio streams

*Dataset included speeh-only, music-only and speech-music overlaping audio samples - for further details loook at the paper

Roc-curves of the two proposed methods ie. with(red) and withought(blue) transfer-learning on the same dataset

Evaluation of our best method(pink) against the methods proposed by Pikrakis & Theodoridis on datasetA and datasetB

The repository consists of the following modules:

  • Audio segmentation using the PyAudioAnalysis lybrary
  • CNN training using the CAFFE Deep-Learning Framework.
  • Audio classification using:
  • CNNs
  • CNNs + median_filtering
  • CNNs + median_filtering + HMMs
  • Two pretrained CNNs on the task of Speech/Music Discrimination. The network can be also used for weight initialization for other similar tasks. (to be added)
  • An audio dataset consisting of more than 10h continous audio streams. At this point the data are available in the form of spectrograms. (to be added)

Installation

  • Dependencies
  1. PyAudio
  2. CAFFE Deep-Learning Framework

* Installation instructions offered in detail on the above links

  • Add Caffe to your working dir
  1. trainCNN.py --> Line:4
  2. train_net.sh --> Line:2
  3. ClassifyWav.py --> Line:14

or add pycaffe to your .bashrc for directory independent access

  • open .bashrc file located at your home directory In a terminal type:
    1. cd ~ to navigate to your home directory

    2. ls -a to see the file listed

    3. nano .bashrc to open the file in terminal

    4. scroll at the bottom of the file and add:

      export PYTHONPATH=$PYTHONPATH:"/home/--myPathToCaffe--/caffe/python"

      , where --myPathToCaffe-- is the path to the caffe library as it appears in your local machine

      i.e.: export PYTHONPATH=$PYTHONPATH:"/home/michalis/Liraries/caffe/python"

    5. source ~/.bashrc to update your source file

Code Description

Data Preparation

  1. Convert your audio files into pseudocolored RGB or grayscale spectrogram images using generateSpectrograms.py TO BE UPDATED a)How to run, b)How to set segmentation parameters c) How the output looks like

  2. Split the spectrogram images into train and test as shown in Fig1:

Fig1. - Data Structure
  • Train/Test and Classes represent directories
  • Samples represent files
  • If you wish to use the architecture proposed in this work:
  1. Data should be pseudo-colored RGB spectrogram images of size 227x227 as shown in Fig2
Fig2. - Sample RGB Spectrogram
2. or grayscale spectrogram images of size 200x200 as shown in Fig3

Fig3. - Sample Grayscale Spectrogram
  • Image resizing can be done directly using CAFFE framework.

Training

  • Train a CNN
  1. Provide Network Architecture file. You can use one of the proposed architectures (SpeechMusic_RGB.prototxt, SpeechMusic_GRAY.prototxt ) or another CNN of your choice.

  2. Train

    Training can be done either by training a new network from sratch or by finetuning a pretrained architecture.

    The pretrained model used in the paper for fine-tuning is the caffe_imagenet_hyb2_wr_rc_solver_sqrt_iter_310000 initially proposed in Donahue, Jeffrey, et al. "Long-term recurrent convolutional networks for visual recognition and description." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. To exploit the weight initialization of the pretrained model use the CNN architecture shown in SpeechMusic_RGB.prototxt.

    If you wish to deploy the smaller CNN architecture that operates on grayscale images you should use the CNN architecture shown in SpeechMusic_GRAY.prototxt. This model was trained from scratch without weight initialization.

    • Train from scratch:
  python trainCNN.py <architecture_file>.prototxt <path_to_train_data_root_foler> <path_to_test_data_root_foler> <snapshot_prefix> <total_number_of_iterations> 
* Finetune pretrained network:
python trainCNN.py <architecture_file>.prototxt <path_to_train_data_root_foler> <path_to_test_data_root_foler> <snapshot_prefix> <total_number_of_iterations> --init <pretrained_network>.caffemodel --init_type fin 
* Resume Training:
python trainCNN.py <architecture_file>.prototxt <path_to_train_data_root_foler> <path_to_test_data_root_foler> <snapshot_prefix> <total_number_of_iterations> --init <pretrained_network>.solverstate --init_type res 
* For more details about modifying other learning parameters (i.e learning rate, step size etc.) type:
```shell
 python trainCNN.py -h 
``` 
  1. Outputs:
    1. _<snapshot_prefix>solver.prototxt Solver file required by caffe to train the CNN. The solver file describes all the parameters of the current experients. Commented lines have additional information regarding the experiments that are not required by the Caffe framework.
    2. _<snapshot_prefix>TrainSource.txt & _<snapshot_prefix>TestSource.txt Full paths to training and test samples with each samples class
  • Train HMM
python ClassifyWav.py trainHMM <path_to_test_data> <hmm_model_name> <core_classification_method> <trained_network> <classification_method> 
  *This applies after having a trained CNN

  **Change [trainCNN.py](https://github.com/MikeMpapa/CNNs-Speech-Music-Discrimination/blob/master/trainCNN.py), Line:9, to  ``` caffe.set_mode_gpu() ```  to support GPU implementation** 

Classification

  • Evaluate trained CNN Model with/without post processing:
python ClassifyWav.py evaluate <path_to_test_wav_files> <trained_network>.caffemodel  <classification_method> <classification_type_flag> "" 
  • Evaluate trained HMM Model with post processing:
python ClassifyWav.py evaluate <path_to_test_wav_files> <trained_network>-5000.caffemodel <core_classification_method> <classification_type_flag> <hmm_model_name> 

Change ClassifyWav.py, Line:17, to caffe.set_mode_gpu() to support GPU implementation

Code Example

  • Generate Spectrogram Images:

  • Train from scratch:

    python trainCNN.py SpeechMusic_RGB.prototxt Train Test myOutput 4000 
  • Finetune pretrained network (train and test paths are according to Fig1):

    python trainCNN.py SpeechMusic_RGB.prototxt Train Test myOutput 1000 --init caffe_imagenet_hyb2_wr_rc_solver_sqrt_iter_310000.caffemodel --init_type fin
  • Resume training from pretrained network (train and test paths are according to Fig1):

    python trainCNN.py SpeechMusic_RGB.prototxt Train Test my_new_Output 2000 --init myOutput.solverstate --init_type res 
  • Evaluate trained CNN on .wav file/s without preprosesing:

    python ClassifyWav.py evaluate Data/testWavs CNN-SM-5000.caffemodel  cnn 0 "" 
  • Evaluate trained CNN on .wav file/s with preprosesing:

    python ClassifyWav.py evaluate Data/testWavs CNN-SM-5000.caffemodel  cnn 1 "" 
  • Train an HMM after applying median filtering:

    python ClassifyWav.py trainHMM Data/testWavs hmm1 cnn CNN-SM-5000.caffemodel 1 
  • Test using pretrained HMM:

    python ClassifyWav.py evaluate Data/testWavs CNN-SM-5000.caffemodel cnn 2 hmm1 

Pretrained model

A pretrained model on the task using pseudo-colored RGB images along with the solverstate can be found here

Conclusions

We provide a new method for the task of Speech/Music Discrimination using Convolutional Neural Networks. The main contributions of this work are the following:

  1. A compact framework for: * Segmenting and Classifying long audio streams into Speech and Music segments. * Train new CNN models on binary audio tasks

  2. A big dataset on long audio streams (more than 10h) for the task of speech music discrimination. The dataset is provided in the form of spectrograms.

  3. Two different pretrained CNN architectures that can be used for weight initialization for other binary classification tasks.

  4. To our knowledge our method provides state-of-the-art results on the task

References & Citations

If your found our project usefull please cite the following referenced publications:

CNNs:Speech-Music-Discrimination @article{papakostas2018speech, title={Speech-Music Discrimination Using Deep Visual Feature Extractors}, author={Papakostas, Michalis and Giannakopoulos, Theodoros}, journal={Expert Systems with Applications}, year={2018}, publisher={Elsevier} }

PyAudioAnalysis @article{giannakopoulos2015pyaudioanalysis, title={pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis}, author={Giannakopoulos, Theodoros}, journal={PloS one}, volume={10}, number={12}, year={2015}, publisher={Public Library of Science} }

Caffe Framework @article{jia2014caffe, Author = {Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev, Sergey and Long, Jonathan and Girshick, Ross and Guadarrama, Sergio and Darrell, Trevor}, Journal = {arXiv preprint arXiv:1408.5093}, Title = {Caffe: Convolutional Architecture for Fast Feature Embedding}, Year = {2014} }

If you used the pretrained network caffe_imagenet_hyb2_wr_rc_solver_sqrt_iter_310000 for your experiments, please also cite:

@inproceedings{donahue2015long, title={Long-term recurrent convolutional networks for visual recognition and description}, author={Donahue, Jeffrey and Anne Hendricks, Lisa and Guadarrama, Sergio and Rohrbach, Marcus and Venugopalan, Subhashini and Saenko, Kate and Darrell, Trevor}, booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition}, pages={2625--2634}, year={2015} }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.