Git Product home page Git Product logo

lstm_pit_speech_separation's Introduction

LSTM/BLSTM based PIT for Two Speakers

====================================================================================
                Two-speaker speech separation with BLSTM and PIT
                   Author: Peng Chao, EECS, Peking University
            Github: https://github.com/pchao6/LSTM_PIT_Speech_Separation
                            Created in: June 2018
====================================================================================

The progress made in multitalker mixed speech separation and recognition, often referred to as the "cocktail-party problem", has been less impressive. Although human listeners can easily perceive separate sources in an acoustic mixture, the same task seems to be extremely difficult for computers, especially when only a single microphone recording the mixed-speech.

1. Speration Performance

Notice: The training set and the validation set that contain two-speaker mixtures generated by randomly selecting speakers and utterances from the WSJ0 set, and mixing them at various signal-to-noise ratios (SNRs) uniformly chosen between -2.5 dB and 2.5 dB.

The separation performance of LSTM are as follows:

Gender Combination SDR SAR SIR STOI ESTOI PESQ
Overall 6.453328 9.372059 11.570311 0.536987 0.429255 1.653391
Male & Female 8.238905 9.939668 14.531649 0.521656 0.421868 1.663442
Female & Female 3.538810 8.134054 7.230494 0.560099 0.441704 1.553452
Male & Male 5.011563 9.026763 9.000010 0.550071 0.435083 1.675609

The separation performance of BLSTM are as follows:

Gender Combination SDR SAR SIR STOI ESTOI PESQ
Overall 9.177447 10.629142 16.116564 0.473229 0.377204 1.651099
Male & Female 10.647645 11.691969 18.203052 0.488542 0.393999 1.731112
Female & Female 7.309365 9.393608 13.355384 0.459762 0.363213 1.478075
Male & Male 7.797448 9.589827 14.198003 0.456667 0.358757 1.602058

From above results we can see that the separation effect of mixed gender audio is better than that of the same gender and BLSTM performs better than LSTM.

2. Evaluation Criterion

  • SDR: Signal to Distortion Ratio
  • SAR: Signal to Artifact Ratio
  • SIR: Signal to Interference Ratio
  • STOI: Short Time Objective Intelligibility Measure
  • ESTOI: Extended Short Time Objective Intelligibility Measure
  • PESQ: Perceptual Evaluation of Speech Quality

3. Dependency Library

  • librosa
  • matlab (my test version: R2016b 64-bit)
  • tensorflow (my test version: 1.4.0)
  • anaconda3 (Python3.5+)

4. Usage Process

Generate Mixed and Target Speech:

When you have WSJ0 data, you can use the code "create-speaker-mixtures-V1/V2" to create the mixed speech. We mixed 2-speaker audios with samplerate 8000.

Run the command line script:

bash run.sh

which contains three steps:

  1. Extract STFT features, and convert them to the tfrecords format of Tensorflow. The training data is ready here. The file structure of training data is now as follows:
storage/
├── lists
│   ├── cv_tf.lst
│   ├── cv_wav.lst
│   ├── tr_tf.lst
│   ├── tr_wav.lst
│   ├── tt_tf.lst
│   └── tt_wav.lst
├── separated
├── TFCheckpoint
└── tfrecords
    ├── cv_tfrecord
    │   ├── 01aa010k_1.3053_01po0310_-1.3053.tfrecords
    │   ├── 01aa010p_0.93798_02bo0311_-0.93798.tfrecords
    │   ├── ...
    │   └── 409o0317_1.2437_025c0217_-1.2437.tfrecords
    ├── tr_tfrecord
    │   ├── 01aa010b_0.97482_209a010p_-0.97482.tfrecords
    │   ├── 01aa010b_1.4476_20aa010p_-1.4476.tfrecords
    │   ├── ...
    │   └── 409o0316_1.3942_20oo010p_-1.3942.tfrecords
    └── tt_tfrecord
        ├── 050a050a_0.032494_446o030v_-0.032494.tfrecords
        ├── 050a050a_1.7521_422c020j_-1.7521.tfrecords
        ├── ...
        └── 447o0312_2.0302_440c0206_-2.0302.tfrecords

Note: {tr,cv,tt}_wav.lst is like as follows:

447o030v_0.1232_050c0109_-0.1232.wav
447o030v_1.7882_444o0310_-1.7882.wav
...
447o030x_0.98832_441o0308_-0.98832.wav
447o030x_1.4783_422o030p_-1.4783.wav

And {tr,cv,tt}_tf.lst is like as follows:

storage/tfrecords/cv_tfrecord/011o031b_1.8_206a010u_-1.8.tfrecords
storage/tfrecords/cv_tfrecord/20ec0109_0.47371_020c020q_-0.47371.tfrecords
...
storage/tfrecords/cv_tfrecord/01zo030l_0.6242_40ho030s_-0.6242.tfrecords
storage/tfrecords/cv_tfrecord/20fo0109_1.1429_017o030p_-1.1429.tfrecords
  1. Train the deep learning neural network.
  2. Decode the network to generate separation audios.

5. File Description

  • 1.create-speaker-mixtures-V1: Version one of scripts to generate the wsj0-mix multi-speaker dataset.
  • 2.create-speaker-mixtures-V2: Version two of scripts to generate the wsj0-mix multi-speaker dataset.
  • 3.SPHFile2Wav: Converting SPH format of TIMIT and WSJ0 corpus into wav format.
  • 4.introduction_to_mask: tntroduction to the Computational Auditory Scene Analysis (mask-based method) in speech separation.

mixed speech:

masks:

recovered speech 1:

recovered speech 2:

  • 5.step_to_CASA_DL: Step to multi-speaker speech separation with Computational Auditory Scene Analysis and Deep Learning.
  • 6.separated_result_LSTM: Demos of separated speech based on LSTM and PIT.
  • 7.separated_result_BLSTM: Demos of separated speech based on BLSTM and PIT.

6. Reference Paper & Code

Thank Dong Yu et al. for the paper and Sining Sun (Northwestern Polytechnical University, China) et al. for sharing their code.

7. Directions of Future Research

  • Scaling down DNNs without compromising performance.
  • Multiple microphone algorithms.
  • Beyond single-modality algorithm, for example, visual perception.
  • Beyond the Mean Squared Error Cost Function
  • Towards Time-Domain End-to-End system.

I will study on speech separation for a long time. You can pay close attention to my recent work and follow me if interested. Thanks for your attention!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.