Git Product home page Git Product logo

masn-pytorch's Introduction

Motion-Appearance Synergistic Networks for VideoQA (MASN)

Pytorch Implementation for the paper:

Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering
Ahjeong Seo, Gi-Cheon Kang, Joonhan Park, and Byoung-Tak Zhang
In ACL 2021

Requirements

python 3.7, pytorch 1.2.0

Dataset

Extract Features

  1. Appearance Features
  • For local features, we used the Faster-RCNN pre-trained with Visual Genome. Please cite this Link.
    • After you extracted object features by Faster-RCNN, you can convert them to hdf5 file with simple run: python adaptive_detection_features_converter.py
  • For global features, we used ResNet152 provided by torchvision. Please cite this Link.
  1. Motion Features
  • For local features, we use RoIAlign with bounding box features obtained from Faster-RCNN. Please cite this Link.
  • For global features, we use I3D pre-trained on Kinetics. Please cite this Link.

We uploaded our extracted features:

  1. TGIF-QA
  1. MSRVTT-QA
  1. MSVD-QA

Training

Simple run

CUDA_VISIBLE_DEVICES=0 python main.py --task Count --batch_size 32

For MSRVTT-QA, run

CUDA_VISIBLE_DEVICES=0 python main_msrvtt.py --task MS-QA --batch_size 32

For MSVD-QA, run

CUDA_VISIBLE_DEVICES=0 python main_msvd.py --task MS-QA --batch_size 32

Saving model checkpoints

By default, our model save model checkpoints at every epoch. You can change the path for saving models by --save_path options. Each checkpoint's name is '[TASK]_[PERFORMANCE].pth' in default.

Evaluation & Results

CUDA_VISIBLE_DEVICES=0 python main.py --test --checkpoint [NAME] --task Count --batch_size 32

Performance on TGIF-QA dataset:

Model Count Action Trans. FrameQA
MASN 3.75 84.4 87.4 59.5

You can download our pre-trained model by this link : Count, Action, Trans., FrameQA

Performance on MSRVTT-QA and MSVD-QA dataset:

Model MSRVTT-QA MSVD-QA
MASN 35.2 38.0

Citation

If this repository is helpful for your research, we'd really appreciate it if you could cite the following paper:

@inproceedings{seo-etal-2021-attend,
    title = "Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering",
    author = "Seo, Ahjeong  and
      Kang, Gi-Cheon  and
      Park, Joonhan  and
      Zhang, Byoung-Tak",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.481",
    doi = "10.18653/v1/2021.acl-long.481",
    pages = "6167--6177",
    abstract = "Video Question Answering is a task which requires an AI agent to answer questions grounded in video. This task entails three key challenges: (1) understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, causality), and (3) cross-modal grounding between language and vision information. We propose Motion-Appearance Synergistic Networks (MASN), which embed two cross-modal features grounded on motion and appearance information and selectively utilize them depending on the question{'}s intentions. MASN consists of a motion module, an appearance module, and a motion-appearance fusion module. The motion module computes the action-oriented cross-modal joint representations, while the appearance module focuses on the appearance aspect of the input video. Finally, the motion-appearance fusion module takes each output of the motion module and the appearance module as input, and performs question-guided fusion. As a result, MASN achieves new state-of-the-art performance on the TGIF-QA and MSVD-QA datasets. We also conduct qualitative analysis by visualizing the inference results of MASN.",
}

License

MIT License

Acknowledgements

This work was partly supported by the Institute of Information & Communications Technology Planning & Evaluation (2015-0-00310-SW.StarLab/25%, 2017-0-01772-VTT/25%, 2018-0-00622-RMI/25%, 2019-0-01371-BabyMind/25%) grant funded by the Korean government.

masn-pytorch's People

Contributors

ajseo95 avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

loverlost

masn-pytorch's Issues

Why the input dimension of bbox is 6 ?

Thank you for sharing such a great codebase !
But I have a puzzle. Detection box is usually represented as 4 dimensions, (x,y,w,h) or (x1,y1,x2,y2).
Why is the input dimension of bbox_location_encoding = nn.Linear(6, 64) is 6 ?

How is the local motion feature extracted exactly?

I have looked into the local motion feature and found out that most of the motion features of objects are zeros. Additionally, for the features that are not zero, they have the same (small) value in each dimension, not likely to be the C3D features with RoIAlign. So how is the local motion feature extracted exactly?

about extract 2d global feature with resnet152 as backbone

thanks for your excellent work, and i have one question about the extracting 2d appearance feature. when using the resnet152 as the backbone, the output of layer4(before avg_pooling) is [frame 2048 7 7], frames refer to the length of the clip. then stack clips, i get [T len 2048 7 7.]
So can you share how you handle the resnet152 and get the appearance feature claimed in the paper that the dim is T*d
thanks very much

Mismatch between detected bounding boxes and sampled frames

Hi, I sampled the gifs using the process mentioned in your paper. And them I visualise the bounding boxes provided by in "tgif_btup_f_obj10.hdf5". But the results are not matched. Specifically, I found the order of frames in a gif is not corresponded with that of boxes in hdf5 file.

The code for the frame extraction

Thank you for sharing such a great codebase! Is it convenient for you to share the code for the frame extraction of the videos or share the extracted frames directly? Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.