By Zhi-Song Liu, Robin Courant and Vicky Kalogeiton
We present FunnyNet-W, a versatile and efficient framework for funny moment detection in the video.
Project Page | Paper | Data
@InProceedings{funnynet-w,
author = {Liu, Zhi-Song and Courant, Robin and Kalogeiton, Vicky},
title = {FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild},
booktitle = {International Journal of Computer Vision},
year = {2024},
pages={},
doi={}
}
Python 3.8 OpenCV library Pytorch 1.12.0 CUDA 11.3
- Clone code to your local computer.
git clone https://github.com/Holmes-Alan/FunnyNet-W.git
cd FunnyNet-W
- Create working environment.
conda create --name funnynet -y python=3.8
conda activate funnynet
- Install the dependencies.
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch
pip install -r requirements.txt
- Run the setup script to intsall all the dependencies.
./setup.sh
- Download friends data:
gdown https://drive.google.com/drive/folders/1ZM6agmEnheiyP0IIrD3Fc7DOubjyu5eO -O ./data --folder
Note: label files are strutured as follow: [season, episode, funny-label, start, end]
The dataset directory is organized as followed:
FunnyNet-data/
└── tv_show_name/
├── audio/
│ ├── diff/ # `.wav` files with stereo channel difference
│ ├── embedding/ # `.pt` files with audio embedding vectors
│ ├── laughter/ # `.pickle` files with laughter timecodes
│ ├── laughter_segment/ # `.wav` files with detected laughters
│ ├── left/ # `.wav` files with the surround left channel
│ └── raw/ # `.wav` files with extracted raw audio from videos
├── laughter/ # `.pk` files with laughter labels
├── sub/ # `.pk` files with subtitles
├── episode/ # `.mkv` files with videos
├── audio_split/ # `.wav` files with audio 8 seconds windows
│ ├── test_8s/
│ ├── train_8s/
│ └── validation_8s/
├── video_split/ # `.mp4` files with video 8 seconds windows
│ ├── test_8s/
│ ├── train_8s/
│ └── validation_8s/
└── sub_split/ # `.pk` files with subtitles 8 seconds windows
| ├── sub_test_8s.pk
| ├── sub_train_8s.pk
| └── sub_validation_8s.pk
└── automatic_sub_split/ # `.pk` files with automatic subtitles 8 seconds windows
├── sub_test_8s.pk
├── sub_train_8s.pk
└── sub_validation_8s.pk
Note: we cannot provide audio and video data for obvious copyright issues.
Split audio, subtitles and videos into segments of n seconds (default 8 seconds), and use Whisper to generate automatic subtitles from audio in the wild:
python data_processing/mask_audio.py DATA_DIR/audio/raw DATA_DIR/audio/laughter DATA_DIR/audio/processed
python data_processing/audio_processing.py DATA_DIR/audio/processed DATA_DIR/laughter/xx.pk DATA_DIR/audio_split
python data_processing/sub_processing.py DATA_DIR/sub DATA_DIR/laughter/xx.pk DATA_DIR/sub_split
python data_processing/video_processing.py DATA_DIR/episode DATA_DIR/laughter/xx.pk DATA_DIR/video_split
python data_processing/whisper_extractor.py DATA_DIR/audio_split DATA_DIR/laughter/xx.pk DATA_DIR/automatic_sub_split
- Train multimodality with audio, vision and subtitle
python main_audio+vision+sub_videomae_llama friends_path llama2_pts_path
- Download pretrained model from this link, and put it under "./models"
- Test multimodality with audio, vision and subtitle
python eval_audio+vision+sub_videomae_llama friends_path llama2_pts_path --model_file models/audio+vision+sub_videomae_llama_whisper.pth
Please follow our previous work on FunnyNet
If you find our work useful and interesting to you, please consider citing our papers
@InProceedings{funnynet,
author = {Liu, Zhi-Song and Courant, Robin and Kalogeiton, Vicky},
title = {FunnyNet: Audiovisual Learning of Funny Moments in Videos},
booktitle = {Asian Conference on Computer Vision (ACCV)},
year = {2023},
pages={433-450},
doi = {10.1007/978-3-031-26316-3_26}
}
@InProceedings{funnynet-w,
author = {Liu, Zhi-Song and Courant, Robin and Kalogeiton, Vicky},
title = {FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild},
booktitle = {International Journal of Computer Vision},
year = {2024},
pages={},
doi={}
}