mlvlab / flipped-vqa Goto Github PK

View Code? Open in Web Editor NEW

55.0 6.0 7.0 1.26 MB

Large Language Models are Temporal and Causal Reasoners for Video Question Answering (EMNLP 2023)

Home Page: https://ikodoh.github.io/flipped_vqa_demo.html

License: MIT License

Python 99.67% Shell 0.33%

emnlp2023 large-language-models multi-modal video-question-answering visual-question-answering

flipped-vqa's Introduction

Large Language Models are Temporal and Causal Reasoners for Video Question Answering

This is the official implementation of Flipped-VQA (EMNLP 2023) (arxiv) (demo).

Dohwan Ko^1*, Ji Soo Lee^1*, Wooyoung Kang², Byungseok Roh², Hyunwoo J. Kim¹.

¹Department of Computer Science and Engineering, Korea University ²Kakao Brain

Setup

To install requirements, run:

git clone https://github.com/mlvlab/Flipped-VQA.git
cd Flipped-VQA
mkdir pretrained
mkdir data
conda create -n flipped-vqa python=3.8
conda activate flipped-vqa
sh setup.sh

Dataset & LLaMA Preparation

You can download our preprocessed datasets (NExT-QA, STAR, DramaQA, VLEP and TVQA) at here. Put them in ./data. Also, you can download original LLaMA at here, and put the checkpoint in ./pretrained.

./pretrained
   └─ llama
       |─ 7B
       |   |─ consolidated.00.pth
       |   └─ params.json
       |─ 13B
       |   :
       |─ 33B
       |   :
       └─ tokenizer.model
./data
   |─ nextqa
   |   |─ train.csv
   |   |─ val.csv
   |   └─ clipvitl14.pth
   |─ star
   |   :
   |─ dramaqa
   |   :
   |─ vlep
   |   :
   └─ tvqa
       :

Training LLaMA-VQA (LLaMA + Flipped-VQA)

NExT-QA

torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 4 train.py --model 7B \
--max_seq_len 128 --batch_size 8 --epochs 5 --warmup_epochs 2 --bias 3.5 --tau 100. --max_feats 10 --dataset nextqa \
--blr 9e-2 --weight_decay 0.14 --output_dir ./checkpoint/nextqa --accum_iter 2 --vaq --qav

STAR

torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 4 train.py --model 7B \
--max_seq_len 128 --batch_size 8 --epochs 5 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset star \
--blr 9e-2 --weight_decay 0.16 --output_dir ./checkpoint/star --accum_iter 1 --vaq --qav

DramaQA

torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 4 train.py --model 7B \
--max_seq_len 384 --batch_size 2 --epochs 5 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset dramaqa \
--blr 9e-2 --weight_decay 0.10 --output_dir ./checkpoint/dramaqa --accum_iter 8 --vaq --qav

VLEP

torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 4 train.py --model 7B \
--max_seq_len 256 --batch_size 4 --epochs 5 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset vlep \
--blr 6e-2 --weight_decay 0.20 --output_dir ./checkpoint/vlep --accum_iter 8 --sub --qav

TVQA

torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 8 train.py --model 7B \
--max_seq_len 650 --batch_size 1 --epochs 5 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset tvqa \
--blr 7e-2 --weight_decay 0.02 --output_dir ./checkpoint/tvqa --dataset tvqa --accum_iter 4 --sub --vaq --qav

We also provide fine-tuned checkpoints on each dataset at here.

Evaluation

From the training command, simply replace train.py to eval.py and add --resume ./your/checkpoint.pth.

Acknowledgements

This repo is built upon LLaMA-Adapter.

Citations

@inproceedings{ko2023large,
  title={Large Language Models are Temporal and Causal Reasoners for Video Question Answering},
  author={Ko, Dohwan and Lee, Ji Soo and Kang, Wooyoung and Roh, Byungseok and Kim, Hyunwoo J},
  booktitle={EMNLP},
  year={2023}
}

flipped-vqa's People

Contributors

Stargazers

Watchers

Forkers

inesriahi guspan-tanadi kai0226 dozi01 greeksharifa dogyunpark samriddhi99

flipped-vqa's Issues

finetuned using lamma-13B

Hello, if I want to use llama-13B's pth for fine-tuning, what changes need to be made to the train.sh script? After fine-tuning according to the parameters of llama-7B, the accuracy is very low.

What is the function of the parameter `max_feats`?

Thank you for the great work and open source code!

I have been exploring the codebase and documentation, and I came across the max_feats parameter. However, I couldn't find a clear explanation for its purpose. Could you please provide some insights into what role max_feats plays in the functionality of the project?

Thank you in advance for taking the time to address this inquiry.

about self.gate2

in llama/model/Attention, there is a:

self.gate2 = torch.nn.Parameter(torch.ones(1, self.n_local_heads, 1, 1) * -args.bias)

and in forward of Attention, when compute attention score map there is a:

vt_scores[:, :, video_start + self.max_feats:, video_start:video_start + self.max_feats] =
vt_scores[:, :, video_start + self.max_feats:, video_start:video_start + self.max_feats] + \
self.gate2.half()

i didn't understand. what is the use of gate2?

From where to download LLaMA-v1 model?

Hi!
Thank you for the great work!

It looks like this code is based on LLaMAv1 models. Although you provided the link to the original repository, I didn't get a reply when I applied to the form. Is there any other way to get the models' weights?

How to extract features using CLIP VIT-L?

I used CLIP to extract video features on my own dataset, but qav_loss did not decrease at all. I tried the provided features on the Next-QA dataset and found the qav_loss can converge in 5 epochs. I don't know why...
I would like to ask if you can provide the code of the feature extraction part, thank you very much...

Concerns and Clarifications Regarding MCQ to Generation Task Conversion

Hi there!
I have been working on converting the task from MCQ settings to generation settings. So, I have modified the data loader part to remove the choices given along the input and make the output return the full answer directly.

This is a summary of my additions to add the support for the generation task:

Extracted the most likely token sequence from vqa_output using torch.argmax and reshaped it to match batch and sequence length dimensions.
Created a mask (vqa_placeholder_mask) to identify the answer part in the sequence.
Implemented logic to extract answers from each choice in the batch, considering start and end tokens.
Encoded extracted answers to tensors, padded them for uniform length, and converted them into embeddings.
Aggregated the answer embeddings and reshaped them to match the batch and embedding size dimensions.
Filtered the output tokens based on the placeholder mask to identify relevant answer parts.
Processed each set of output tokens, identifying the end of answers using 'eos_id', and embedded the tokens.
Aggregated these embeddings along the sequence length by computing the mean along the sequence dim.
Calculated cosine similarity for each instance in the batch, considering the options.

However, when training the model, I had noticed that the output answers from the model are not meaningful, and thus the similarity computation does not work as expected.
I can see in the code that during inference, the model is given all the choices, so I interpreted this as: "the model is given the same input multiple times since anyways the choice part is not included in the loss computation". Depending on this assumption, I extracted the first output from every batch and took this as the output that will be used for similarity computations. However, debugging the results showed that my assumption was incorrect. I'm not sure about the approach I should go for now to get the similarity correct.

Here, I'm summarizing my questions:

Is the assumption that the output should be the same across the 5 options due to input masking correct, or is it more appropriate to input only one option and compare the generated answer?
Are there any potential issues or improvements in the way answers are extracted, encoded, and embedded?
Is the current method of calculating cosine similarity between the generated answer and the other options the optimal approach for this task?

Error when training with TVQA dataset: AttributeError in DataLoader worker process

Issue Description
I encountered an AttributeError when attempting to train a model using the TVQA dataset. All other datasets worked fine during the training setup.

Steps to Reproduce
Ran the following command:

torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 2 train.py --model 7B \
--max_seq_len 650 --batch_size 1 --epochs 5 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset tvqa \
--blr 7e-2 --weight_decay 0.02 --output_dir ./checkpoint/tvqa --dataset tvqa --accum_iter 4 --sub --vaq --qav

Expected Behavior
The training process should have started without any issues.

Actual Behavior
The process failed with the following error message:

[19:42:30.947615] Start training for 5 epochs
Traceback (most recent call last):
  File "train.py", line 153, in <module>
    main(args)
  File "train.py", line 129, in main
    train_stats = train_one_epoch(model, data_loader_train, optimizer, epoch, loss_scaler, args=args)
  File "/home/admin-guest/Documents/multimodal-ml/iqui/Flipped-VQA/engine.py", line 19, in train_one_epoch
    for data_iter_step, data in enumerate(metric_logger.log_every(data_loader, print_freq, header)):
  File "/home/admin-guest/Documents/multimodal-ml/iqui/Flipped-VQA/util/misc.py", line 129, in log_every
    for obj in iterable:
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
AttributeError: Caught AttributeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/admin-guest/Documents/multimodal-ml/iqui/Flipped-VQA/dataloader/tvqa.py", line 173, in __getitem__
    video, video_len = self._get_video(f'{vid}', start, end)
  File "/home/admin-guest/Documents/multimodal-ml/iqui/Flipped-VQA/dataloader/tvqa.py", line 60, in _get_video
    if len(video) > self.max_feats:
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/utils/data/dataset.py", line 83, in __getattr__
    raise AttributeError
AttributeError

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 14990) of binary: /home/admin-guest/anaconda3/envs/flippedvqa_env/bin/python
Traceback (most recent call last):
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-07_19:42:35
  host      : iquibalh-desktop
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 14990)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Environment
OS: [Ubuntu 22.04.3 LTS]
Python version: [3.8.18]
PyTorch version: [1.10.0+cu111]
GPU: 2x[NVIDIA Corporation [RTX A6000] 48GB]
Any other relevant environment details
Additional Context
This issue did not occur with other datasets, only with TVQA, all other datasets started training except for this one.

need llama-13B finetuned checkpoints

Hi, can I achieve using llama-13B fine-tuned checkpoints on the next-qa dataset? Thank you so much!

A question about QAV task in the code

First of all, thank you very much for this work.

I observed that the video frames in the QAV task are not randomized. I believe that it's essential to shuffle the frame order to prevent potential information leakage, as having a fixed order could enable the model to predict it without truly understanding the content of the frames.

I attempted to shuffle the frame order and adjusted the corresponding labels accordingly. However, despite these changes, the loss for the task did not decrease.

These are the modifications I added in nextqa.py

        # shuffle video frames order for qav task
        shuffeld_frame_indecies = torch.randperm(self.max_feats) # permutes from 0 to self.max_feats
        shuffled_video_frames = video.clone()
        shuffled_video_frames = shuffled_video_frames[shuffeld_frame_indecies]
        start_frame_index_label = (label['qav'][0] == 0).nonzero(as_tuple=True)[0].item() # label['qav'] is of shape [1,128]
        label['qav'][:,start_frame_index_label:start_frame_index_label+self.max_feats] = shuffeld_frame_indecies
        if self.args.debug:
            print("Label QAV after shuffling video frames:", label['qav'])
        return {"vid": vid, "video": video, "shuffled_video_frames": shuffled_video_frames,  "video_len": video_len, "text": text, "text_id": text_id, "label": label, "video_start": video_start,
                "video_index": video_index, "label_mask": label_mask, "qid": idx, "answer": answer, "qtype": qtype, "shuffeld_frame_indecies": shuffeld_frame_indecies}

In model.py
shuffled_video = data['shuffled_video_frames'].cuda()

        if self.args.qav and not inference:
            _video_feature_shuffled = self.visual_proj(shuffled_video).half()

I appreciate your clarification

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 3 (pid: 55662) of binary: /usr/bin/python3

Hi!
I'm running into this issue and not sure how to solve it. I get this error when I run it using 8 GPUs. Note that I'm using Llama2 instead of Llama1. I'm using AMD GPUs if this makes a difference.

[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 55659 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 55660 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 55661 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 55663 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 55664 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 55665 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 55666 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 3 (pid: 55662) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-14_13:01:15
  host      : nid005022
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 55662)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: nid005022: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=4915815.0

I appreciate your help!
Thanks

How was the STAR dataset preprocessed for this code

I've been exploring the STAR dataset as described on the official STAR website and noticed some potential differences in the dataset's structure and usage in this repository. Could you please clarify how the STAR dataset used here differs from the version detailed on the official website? Specifically, I'm interested in any modifications or preprocessing steps that were applied to the dataset for integration into Flipped-VQA.

meaning of qav loss

good job! and i can repeat the experiment and get convincing data, since i notice the vaq loss is really useful, and this verify the idea in paper.
however, i wanna know is there any meaning of qav loss? that is the task of restoring video from qa. because i notice the qav loss is not that useful for training better model, and sometimes training with qav loss even perform worse than not.

Checkpoints

I have trained the model with a few datasets, but since the torch seed is changing results are slightly different than those presented in the paper. Could you share the checkpoints to reproduce the results shown in the paper?

What are the average stats being reported at the end of every epoch in training?

What is C, T, D?

Cannot reproduce the result

When I tried to reproduce the experimental result of the STAR dataset on my machine with a single RTX4090 GPU using the checkpoint provided on the google drive, I found that the accuracy is just around 25%. And when I tried to train the model with the following parameters, the loss always goes to nan easily. I'm not sure what the problem is.
python3 train.py --qav --vaq --max_seq_len 128 --batch_size 1 \ --epochs 10 --bias 3 --tau 100. --max_feats 10 --warmup_epochs 2 \ --dataset star --blr 9e-2 --weight_decay 0.16 --output_dir ./checkpoint/star \ --accum_iter 8 --dataset star --model llama-2-7b --llama_model_path ./pretrained/ \ --num_workers 8
The batch_size is set to 1 since the VRAM issue.

Number of frames and its use in code and max_feats10 for video feature

From paper we know that 10 frames from each videos are used for weight file generation. Is that the only places where num of image frames from videos are relevant or number of frames are used later as well in the code?
As I see in the code that we only select 10 tensors (controlled by arg. max_feats) from the clip-ViT weight file. Is the selection of max_feats as 10 due to 10 frames or its an independent decision?

Not getting the reported number.

Hi,

Thank you for the great work, I am planning to try this for my research.
I've written my evaluation code as it was not given by the authors.

import os
import argparse
import datetime
import json
import time
import numpy as np
from pathlib import Path

import torch
import torch.backends.cudnn as cudnn

import timm
import timm.optim.optim_factory as optim_factory

import util.misc as misc
from util.misc import NativeScalerWithGradNormCount as NativeScaler
from engine import train_one_epoch, val_one_epoch
from llama import Tokenizer
from llama_vqa import LLaMA_VQA
from dataloader import load_data

def get_args_parser():
    parser = argparse.ArgumentParser('MAE pre-training', add_help=False)
    parser.add_argument('--batch_size', default=64, type=int, help='Batch size per GPU (effective batch size is batch_size * accum_iter * # gpus')
    parser.add_argument('--epochs', default=400, type=int)
    parser.add_argument('--accum_iter', default=1, type=int, help='Accumulate gradient iterations (for increasing the effective batch size under memory constraints)')

    # Model parameters
    parser.add_argument('--llama_model_path', default='./pretrained/llama/', type=str, help='path of llama model')
    parser.add_argument('--model', default='7B', type=str, metavar='MODEL', help='Name of model to train')
    parser.add_argument('--adapter_layer', type=int, default=32, metavar='LENGTH', help='the number of adapter layer')
    parser.add_argument('--adapter_len', type=int, default=10, metavar='LENGTH', help='the adapter length')
    parser.add_argument('--max_seq_len', type=int, default=512, metavar='LENGTH', help='the maximum sequence length')
    parser.add_argument('--max_feats', type=int, default=10, metavar='LENGTH', help='the maximum feature length')

    # Optimizer parameters
    parser.add_argument('--weight_decay', type=float, default=0.05, help='weight decay (default: 0.05)')
    parser.add_argument('--lr', type=float, default=None, metavar='LR', help='learning rate (absolute lr)')
    parser.add_argument('--blr', type=float, default=1e-3, metavar='LR', help='base learning rate: absolute_lr = base_lr * total_batch_size / 256')
    parser.add_argument('--min_lr', type=float, default=0., metavar='LR', help='lower lr bound for cyclic schedulers that hit 0')
    parser.add_argument('--warmup_epochs', type=int, default=40, metavar='N', help='epochs to warmup LR')

    # Dataset parameters
    parser.add_argument('--dataset', default='nextqa', type=str, help='dataset')
    parser.add_argument('--output_dir', default='./output_dir', help='path where to save, empty for no saving')
    parser.add_argument('--device', default='cuda', help='device to use for training / testing')
    parser.add_argument('--seed', default=0, type=int)
    parser.add_argument('--resume', default='', help='resume from checkpoint')
    parser.add_argument('--start_epoch', default=0, type=int, metavar='N', help='start epoch')
    parser.add_argument('--num_workers', default=2, type=int)
    parser.add_argument('--pin_mem', action='store_true', help='Pin CPU memory in DataLoader for more efficient (sometimes) transfer to GPU.')
    parser.add_argument('--no_pin_mem', action='store_false', dest='pin_mem')
    parser.set_defaults(pin_mem=True)

    # distributed training parameters
    parser.add_argument('--world_size', default=1, type=int, help='number of distributed processes')
    parser.add_argument('--local_rank', default=-1, type=int)
    parser.add_argument('--dist_on_itp', action='store_true')
    parser.add_argument('--dist_url', default='env://', help='url used to set up distributed training')
    
    parser.add_argument('--vaq', action='store_true', help='vaq loss')
    parser.add_argument('--qav', action='store_true', help='qav loss')
    parser.add_argument('--bias', type=float, default=3., help='attention bias')
    parser.add_argument('--tau', type=float, default=100., help='tau')
    parser.add_argument('--sub', action='store_true', help='subtitles for VLEP and TVQA')

    return parser

args = get_args_parser()
args = args.parse_args()

seed = args.seed
torch.manual_seed(seed)
np.random.seed(seed)
cudnn.benchmark = True
tokenizer = Tokenizer(model_path=f'{args.llama_model_path}tokenizer.model')

if args.dataset == 'nextqa':
    args.max_seq_len = 128
    args.batch_size = 8
    data_loader = load_data(args, tokenizer, split='val')
elif args.dataset == 'star':
    args.max_seq_len = 128
    args.batch_size = 8
    data_loader = load_data(args, tokenizer, split='val')
elif args.dataset == 'tvqa':
    args.max_seq_len = 650
    args.batch_size = 1
    data_loader = load_data(args, tokenizer, split='val')
elif args.dataset == 'vlep':
    args.max_seq_len = 256
    args.batch_size = 4
    data_loader = load_data(args, tokenizer, split='val')
elif args.dataset == 'dramaqa':
    args.max_seq_len = 384
    args.batch_size = 2
    data_loader = load_data(args, tokenizer, split='val')


device = torch.device(args.device)
model = LLaMA_VQA(args)
model = model.to(device)

checkpoint = torch.load(f'./pretrained/{args.dataset}.pth', map_location='cpu')
_ = model.load_state_dict(checkpoint['model'], strict=False)


from tqdm.auto import tqdm

model.eval()
accs = {}
step = 0
for data in tqdm(data_loader):
    answer = data['answer'].cuda()
    bsz = answer.shape[0]

    with torch.no_grad():
        logits = model(data, inference=True)
    
    count = (logits != 0).sum(-1)
    prediction = (logits.sum(-1) / count).argmin(-1)

    eval = (answer == prediction)

    for qid, eval_ in zip(data['qid'], eval.tolist()):
        accs[qid] = eval_

    step += 1
    if step % 1000 == 0:
        print(np.mean(list(accs.values())))

print(np.mean(list(accs.values())))

json.dump(accs, open(f'./results/{args.dataset}.json', 'w'))

and run as
python eval.py --dataset nextqa, etc. batchsize and max_seq_len is set inside the code.

	NextQA	STAR	TVQA	VLEP	DramaQA
my acc	71.98	65.33	23.05	60.08	84.11
reported acc	72.0	65.4	82.2	71.0	84.1

Where I cannot get the reported number of two datasets. Can you check my evaluation code why I cannot get the reported number on the paper?

How many GPUs are needed to train the model?

How to use a trained checkpoint to make inference on validation set and resume from checkpoint.

I have trained a model and saved the checkpoint_best.pth, (for nextqa) now, how to use this checkpoint to run inference on the validation dataset? And how to use this trained checkpoint to resume the training process?
Folder structure

./pretrained
└─ llama
|─ 7B
| |─ consolidated.00.pth
| └─ params.json
|─ 13B
| :
|─ 33B
| :
└─ tokenizer.model
./data
|─ nextqa
| |─ train.csv
| |─ val.csv
| └─ clipvitl14.pth
:
./checkpoint
|─ nextqa
| |─ checkpoint_best.pth
| |─ log.txt
: