Git Product home page Git Product logo

regnet's Introduction

Generating Visually Aligned Sound from Videos

This is the official pytorch implementation of the TIP paper "Generating Visually Aligned Sound from Videos" and the corresponding Visually Aligned Sound (VAS) dataset.

Demo videos containing sound generation results can be found here.

Contents



Usage Guide

Getting Started

[back to top]

Installation

Clone this repository into a directory. We refer to that directory as REGNET_ROOT.

git clone https://github.com/PeihaoChen/regnet
cd regnet

Create a new Conda environment.

conda create -n regnet python=3.7.1
conda activate regnet

Install PyTorch and other dependencies.

conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0
conda install ffmpeg -n regnet -c conda-forge
pip install -r requirements.txt

Download Datasets

In our paper, we collect 8 sound types (Dog, Fireworks, Drum, Baby form VEGAS and Gun, Sneeze, Cough, Hammer from AudioSet) to build our Visually Aligned Sound (VAS) dataset. Please first download VAS dataset and unzip the data to $REGNET_ROOT/data/ folder.

For each sound type in AudioSet, we download all videos from Youtube and clean data on Amazon Mechanical Turk (AMT) using the same way as VEGAS.

unzip ./data/VAS.zip -d ./data

Data Preprocessing

Run data_preprocess.sh to preprocess data and extract RGB and optical flow features.

Notice: The script we provided to calculate optical flow is easy to run but is resource-consuming and will take a long time. We strongly recommend you to refer to TSN repository and their built docker image (our paper also uses this solution) to speed up optical flow extraction and to restrictly reproduce the results.

source data_preprocess.sh

Training REGNET

Training the REGNET from scratch. The results will be saved to ckpt/dog.

CUDA_VISIBLE_DEVICES=7 python train.py \
save_dir ckpt/dog \
auxiliary_dim 64 \ 
rgb_feature_dir data/features/dog/feature_rgb_bninception_dim1024_21.5fps \
flow_feature_dir data/features/dog/feature_flow_bninception_dim1024_21.5fps \
mel_dir data/features/dog/melspec_10s_22050hz \
checkpoint_path ''

In case that the program stops unexpectedly, you can continue training.

CUDA_VISIBLE_DEVICES=7 python train.py \
-c ckpt/dog/opts.yml \
checkpoint_path ckpt/dog/checkpoint_018081

Generating Sound

During inference, our RegNet will generate visually aligned spectrogram, and then use WaveNet as vocoder to generate waveform from spectrogram. You should first download our trained WaveNet model for different sound categories ( Dog, Fireworks, Drum, Baby, Gun, Sneeze, Cough, Hammer ).

The generated spectrogram and waveform will be saved at ckpt/dog/inference_result

CUDA_VISIBLE_DEVICES=7 python test.py \
-c ckpt/dog/opts.yml \ 
aux_zero True \ 
checkpoint_path ckpt/dog/checkpoint_041000 \ 
save_dir ckpt/dog/inference_result \
wavenet_path /path/to/wavenet_dog.pth

If you want to train your own WaveNet model, you can use WaveNet repository.

git clone https://github.com/r9y9/wavenet_vocoder && cd wavenet_vocoder
git checkout 2092a64

Enjoy your experiments!

Other Info

[back to top]

Citation

Please cite the following paper if you feel REGNET useful to your research

@Article{chen2020regnet,
  author  = {Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang and Chuang Gan},
  title   = {Generating Visually Aligned Sound from Videos},
  journal = {TIP},
  year    = {2020},
}

Contact

For any question, please file an issue or contact

Peihao Chen: [email protected]
Hongdong Xiao: [email protected]

regnet's People

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.