Git Product home page Git Product logo

speech-separation's Introduction

PyTorch + Catalyst implementation of Looking to Listen at a Cocktail Party.

This repository handles the training process. For inference, checkout the GUI wrapper: SpeechSeparationUI in PyQT.

This repository has been merged with asteroid as a recipe.

Table of Contents

  1. Requirements
  2. Setup
  3. Train
  4. Results
  5. References

Requirements

  1. Computation

    We ran this program on two GPUs, 1050 Mobile and Tesla V100. We did not conduct any benchmarks but, V100 was roughly 400x faster. It also depends on how much data you download. Hence, any server grade GPU should be feasible.

  2. Storage

    This program does generate a lot of files (download and otherwise). Each audio file is 96kiB in size. For 7k unique audio clips and at a 70/30 train and validation split it occupied ~120GiB of storage space. Hence, 1TB minimum if you download more audio clips.

  3. Memory

    Minimum of 4GB VRAM is required. It can handle a batch size of 2. At 20 batch size, on two GPUs, it occupied 16GiB VRAM on each GPU.

Setup

If you are using Docker, just run inside the container:

./setup.sh && ./install.sh

Else

  1. Setup the directory structure

    ./setup.sh
  2. Install dependencies

    pip install -r requirements.txt

    Additional dependencies:

    i. ffmpeg ii. libav-tools ii. youtube-dl iii. sox

  3. Install

    ./install.sh

During inference

from src import generate_audio, load_model

Run

Run all these files as scripts.

cd src/loader

NOTE: Make Sure AVSPEECH dataset is in data/audio_visual/ folder. Downloading requires a Google account.

Download the video dataset - interruptible

python3 download.py

Extract sound from the video

Video length can be more than 3 seconds. Hence, extract multiple audio from a single video file.

python3 extract_audio.py

Mix the audio - interruptible

Synthetically mix clean audio. This can take a lot of space of the disk. 96Kb approx for each file. Total number of files can be: total_filesCinput_audio_size for each train and val.

python3 audio_mixer_generator.py

Remove empty audio

Generating lots of synthetically mixed audio (100+ per second) generates a lot of empty audio files. Hence, we need to remove the empty audio files.

python3 remove_empty_audio.py

Convert the path inside the generated dataframe

Path changes from src and src/loader. Both directory has files that need to manipulate the data/ directory. Hence, create a copy with the correct path in src/loader/

python3 transform_df.py

Run to cache all embeddings

Create video embedding from all the video files. This will also store video which are corrupted. Corrupted video include where face was not detected.

python3 generate_video_embedding.py

Remove corrupt frames

Hence, remove corrupted video frames as well.

python3 remove_corrupt.py

Run to cache all spectrograms (optional)

Cache, all the spectrograms This takes a lot of storage. Tens/Hundreds of GB

python3 convert_to_spec.py

Train the model - interruptible

python3 train.py --bs 20 --workers 4 --cuda True

Results

Unfortunately, we could not train on a bigger dataset.

Example Prediction after 37 epochs (Suffering from overfitting)

validation spectrogram

Loss Plot

loss plot

SNR Plot

snr plot

References

  1. Looking to Listen at a Cocktail Party: https://arxiv.org/abs/1804.03619
  2. Discriminative Loss: https://arxiv.org/abs/1502.04149
  3. PyTorch: pytorch.org
  4. Catalyst: https://github.com/catalyst-team/catalyst
  5. mir_eval: https://github.com/craffel/mir_eval
  6. pysndfx: https://github.com/carlthome/python-audio-effects/tree/master/pysndfx

speech-separation's People

Contributors

dependabot[bot] avatar rajkhandor avatar vinitss avatar vitrioil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

speech-separation's Issues

Reshaping issue

Hi,
When you do this op
imagen
You are commiting a mistake.
You are doing input.view(B, -1, 298, 1) and that is not correct.
In pytorch the reshaping op is ordered from right to left.
You have to do a permutation
permute(0,2,1,3)
and then the reshaping
view(B,298,-1)
Basically this way the values
[0,0,0,0:257]-->[0,0,0:257]
[0,0,0,1:257]-->[0,0,257:257*2]
and so on.
Whay you are doing is mixing the data
You are putting
(B,8,298,257)
[0,0,0,0:257] --> [0,0,0:257]
[0,0,1,0:257] --> [0,0,257:298], then [0,1,0:298-257] and filling the reshaped tensor in a bad way

[Have you overcome overfitting problem]

Hi there,
@vitrioil Just want to ask have you overcame the overfitting problem that you reported in README?
Thanks, Do you have any idea of your overfitting? and any idea to overcome it?
how much data you train on? thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.