Git Product home page Git Product logo

hybridnet's Introduction

Hybrid Reasoning Network for Video-based Commonsense Captioning

Introduction

This repository contains source code for our ACM MM 2021: Hybrid Reasoning Network for Video-based Commonsense Captioning.

This repo should be ready to replicate our results from the paper. If you have any issues with getting it set up though, please file a github issue. Still, the paper is just an arxiv version, so there might be more updates in the future.

Background

This repository is for the new task of video-based commonsense captioning, which aims to generate event-wise captions and meanwhile provide multiple commonsense descriptions (e.g., attribute, effect and intention) about the underlying event in the video.

Dataset

V2C dataset in V2C_annotations.zip, which consists:

V2C_annotations.zip
├── msrvtt_new_info.json                      # MSR-VTT captions and token dictionary.
├── v2c_info.json                             # V2C Raw, captions/CMS, and token dictionary.
├── V2C_MSR-VTT_caption.json                  # V2C Raw, captions/CMS after tokenization.
├── train_cvpr_humanRank_V2C_caption.json     # a human re-verified clean split for V2C annotations.
└── v2cqa_v1_train.json                       # for V2C QA, consisting captions, CMS, and CMS related questions/answers.

Video Features

We use the pre-trained models including ResNet152, SoundNet and I3D to extract the appearance feature, audio feature and motion feature, respectively. Video Features data can be obtained in the link.

Training and Evaluation

  • Enviroment: This implementation was complemented on PyTorch-1.8.1, there was reported some errors if newer version PyToch is usednad we will work on a updation for that later.

E.g., to initiate a training on intention prediction tasks (set --cms 'int'), with 1 RNN video encoder layer, and 6 transformer decoder layers with 8 attention heads, 64 head dim, and 1024 model dim, for 600 epochs under CUDA mode, and shows intermedia generation examples:

python train.py --cms int --batch_size 64 --epochs 600 --num_layer 6 --dim_head 64 --dim_inner 1024 \
                --num_head 8 --dim_vis_feat 2048 --dropout 0.1 --rnn_layer 1 --checkpoint_path ./save \
                --info_json data/v2c_info.json --caption_json data/V2C_MSR-VTT_caption.json \
                --print_loss_every 20 --cuda --show_predict   

For completion evaluations:

python test_cms.py  --cms int --batch_size 64 --num_layer 6 --dim_head 64 --dim_inner 1024 \
                    --num_head 8 --dim_vis_feat 2048 --dropout 0.1 --rnn_layer 1 --checkpoint_path ./save  \
                    --info_json data/v2c_info.json --caption_json data/V2C_MSR-VTT_caption.json  \
                    --load_checkpoint save/**.pth --cuda

For generation evaluations:

python test_cap2cms.py  --cms int --batch_size 64 --num_layer 6 --dim_head 64 --dim_inner 1024 \
                        --num_head 8 --dim_vis_feat 2048 --dropout 0.1 --rnn_layer 1 --checkpoint_path ./save  \
                        --info_json data/v2c_info.json --caption_json data/V2C_MSR-VTT_caption.json \
                        --load_checkpoint save/*.pth --cuda

Model Zoo

Download MODEL_ZOO.zip for our trained captioning models for intention, effects and attributes generations.

Citations

Please consider citing this paper if you find it helpful:

@inproceedings{yu2021hybrid,
      title={Hybrid Reasoning Network for Video-based Commonsense Captioning}, 
      author={Weijiang Yu and Jian Liang and Lei Ji and Lu Li and Yuejian Fang and Nong Xiao and Nan Duan},
      booktitle={Proceedings of the 29th ACM International Conference on Multimedia},
      year={2021}
}

hybridnet's People

Contributors

yuweijiang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

deepaliverma

hybridnet's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.