FunnyNet-W (Multimodal Learning of Funny Moments in Videos in the Wild)

By Zhi-Song Liu, Robin Courant and Vicky Kalogeiton

We present FunnyNet-W, a versatile and efficient framework for funny moment detection in the video.

Project Page | Paper | Data

BibTex

    @InProceedings{funnynet-w,
        author = {Liu, Zhi-Song and Courant, Robin and Kalogeiton, Vicky},
        title = {FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild},
        booktitle = {International Journal of Computer Vision},
        year = {2024},
        pages={},
        doi={}
    }

Dependencies

Python 3.8 OpenCV library Pytorch 1.12.0 CUDA 11.3

Environment setup

Clone code to your local computer.

git clone https://github.com/Holmes-Alan/FunnyNet-W.git
cd FunnyNet-W

Create working environment.

conda create --name funnynet -y python=3.8
conda activate funnynet

Install the dependencies.

conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch
pip install -r requirements.txt

Run the setup script to intsall all the dependencies.

./setup.sh

Download friends data:

gdown https://drive.google.com/drive/folders/1ZM6agmEnheiyP0IIrD3Fc7DOubjyu5eO -O ./data --folder

Note: label files are strutured as follow: [season, episode, funny-label, start, end]

The dataset directory is organized as followed:

FunnyNet-data/
└── tv_show_name/
    ├── audio/
    │   ├── diff/              # `.wav` files with stereo channel difference
    │   ├── embedding/         # `.pt` files with audio embedding vectors
    │   ├── laughter/          # `.pickle` files with laughter timecodes
    │   ├── laughter_segment/  # `.wav` files with detected laughters
    │   ├── left/              # `.wav` files with the surround left channel
    │   └── raw/               # `.wav` files with extracted raw audio from videos
    ├── laughter/              # `.pk` files with laughter labels
    ├── sub/                   # `.pk` files with subtitles
    ├── episode/               # `.mkv` files with videos
    ├── audio_split/           # `.wav` files with audio 8 seconds windows
    │   ├── test_8s/
    │   ├── train_8s/
    │   └── validation_8s/
    ├── video_split/           # `.mp4` files with video 8 seconds windows
    │   ├── test_8s/
    │   ├── train_8s/
    │   └── validation_8s/
    └── sub_split/             # `.pk` files with subtitles 8 seconds windows
    |   ├── sub_test_8s.pk
    |   ├── sub_train_8s.pk
    |   └── sub_validation_8s.pk
    └── automatic_sub_split/   # `.pk` files with automatic subtitles 8 seconds windows
        ├── sub_test_8s.pk
        ├── sub_train_8s.pk
        └── sub_validation_8s.pk

Note: we cannot provide audio and video data for obvious copyright issues.

FunnyNet

Data processing

Split audio, subtitles and videos into segments of n seconds (default 8 seconds), and use Whisper to generate automatic subtitles from audio in the wild:

python data_processing/mask_audio.py DATA_DIR/audio/raw DATA_DIR/audio/laughter DATA_DIR/audio/processed
python data_processing/audio_processing.py DATA_DIR/audio/processed DATA_DIR/laughter/xx.pk DATA_DIR/audio_split
python data_processing/sub_processing.py DATA_DIR/sub DATA_DIR/laughter/xx.pk DATA_DIR/sub_split
python data_processing/video_processing.py DATA_DIR/episode DATA_DIR/laughter/xx.pk DATA_DIR/video_split
python data_processing/whisper_extractor.py DATA_DIR/audio_split DATA_DIR/laughter/xx.pk DATA_DIR/automatic_sub_split

Training

Train multimodality with audio, vision and subtitle

python main_audio+vision+sub_videomae_llama friends_path llama2_pts_path

Testing

Download pretrained model from this link, and put it under "./models"
Test multimodality with audio, vision and subtitle

python eval_audio+vision+sub_videomae_llama friends_path llama2_pts_path --model_file models/audio+vision+sub_videomae_llama_whisper.pth

Laughter detection

Please follow our previous work on FunnyNet

Reference

If you find our work useful and interesting to you, please consider citing our papers

    @InProceedings{funnynet,
        author = {Liu, Zhi-Song and Courant, Robin and Kalogeiton, Vicky},
        title = {FunnyNet: Audiovisual Learning of Funny Moments in Videos},
        booktitle = {Asian Conference on Computer Vision (ACCV)},
        year = {2023},
        pages={433-450},
        doi = {10.1007/978-3-031-26316-3_26}
    }
    
    @InProceedings{funnynet-w,
        author = {Liu, Zhi-Song and Courant, Robin and Kalogeiton, Vicky},
        title = {FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild},
        booktitle = {International Journal of Computer Vision},
        year = {2024},
        pages={},
        doi={}
    }

holmes-alan / funnynet-w Goto Github PK

funnynet-w's Introduction

FunnyNet-W (Multimodal Learning of Funny Moments in Videos in the Wild)

Project Page | Paper | Data

BibTex

Dependencies

Environment setup

FunnyNet

Data processing

Training

Testing

Laughter detection

Reference

funnynet-w's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent