Git Product home page Git Product logo

flipped-vqa-siq's Introduction

Large Language Models are Temporal and Causal Reasoners for Video Question Answering

This is the official implementation of Flipped-VQA (EMNLP 2023) (arxiv) (demo).

Dohwan Ko1*, Ji Soo Lee1*, Wooyoung Kang2, Byungseok Roh2, Hyunwoo J. Kim1.

1Department of Computer Science and Engineering, Korea University 2Kakao Brain

PWC PWC PWC PWC PWC

Setup

To install requirements, run:

git clone https://github.com/mlvlab/Flipped-VQA.git
cd Flipped-VQA
mkdir pretrained
mkdir data
conda create -n flipped-vqa python=3.8
conda activate flipped-vqa
sh setup.sh

Dataset & LLaMA Preparation

You can download our preprocessed datasets (NExT-QA, STAR, DramaQA, VLEP and TVQA) at here. Put them in ./data. Also, you can download original LLaMA at here, and put the checkpoint in ./pretrained.

./pretrained
   └─ llama
       |─ 7B
       |   |─ consolidated.00.pth
       |   └─ params.json
       |─ 13B
       |   :
       |─ 33B
       |   :
       └─ tokenizer.model
./data
   |─ nextqa
   |   |─ train.csv
   |   |─ val.csv
   |   └─ clipvitl14.pth
   |─ star
   |   :
   |─ dramaqa
   |   :
   |─ vlep
   |   :
   └─ tvqa
       :

Training LLaMA-VQA (LLaMA + Flipped-VQA)

NExT-QA

torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 4 train.py --model 7B \
--max_seq_len 128 --batch_size 8 --epochs 5 --warmup_epochs 2 --bias 3.5 --tau 100. --max_feats 10 --dataset nextqa \
--blr 9e-2 --weight_decay 0.14 --output_dir ./checkpoint/nextqa --accum_iter 2 --vaq --qav

STAR

torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 4 train.py --model 7B \
--max_seq_len 128 --batch_size 8 --epochs 5 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset star \
--blr 9e-2 --weight_decay 0.16 --output_dir ./checkpoint/star --accum_iter 1 --vaq --qav

DramaQA

torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 4 train.py --model 7B \
--max_seq_len 384 --batch_size 2 --epochs 5 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset dramaqa \
--blr 9e-2 --weight_decay 0.10 --output_dir ./checkpoint/dramaqa --accum_iter 8 --vaq --qav

VLEP

torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 4 train.py --model 7B \
--max_seq_len 256 --batch_size 4 --epochs 5 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset vlep \
--blr 6e-2 --weight_decay 0.20 --output_dir ./checkpoint/vlep --accum_iter 8 --sub --qav

TVQA

torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 8 train.py --model 7B \
--max_seq_len 650 --batch_size 1 --epochs 5 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset tvqa \
--blr 7e-2 --weight_decay 0.02 --output_dir ./checkpoint/tvqa --dataset tvqa --accum_iter 4 --sub --vaq --qav

We also provide fine-tuned checkpoints on each dataset at here.

Evaluation

From the training command, simply replace train.py to eval.py and add --resume ./your/checkpoint.pth.

Acknowledgements

This repo is built upon LLaMA-Adapter.

Citations

@inproceedings{ko2023large,
  title={Large Language Models are Temporal and Causal Reasoners for Video Question Answering},
  author={Ko, Dohwan and Lee, Ji Soo and Kang, Wooyoung and Roh, Byungseok and Kim, Hyunwoo J},
  booktitle={EMNLP},
  year={2023}
}

flipped-vqa-siq's People

Contributors

ikodoh avatar iquibalamhm avatar

Watchers

 avatar

Forkers

sabujar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.