Git Product home page Git Product logo

llovi's Introduction

LLoVi

This is official implementation for paper: A Simple LLM Framework for Long-Range Video Question-Answering.

Installation

Install environment.

Python 3.8 or above is required.

git clone [email protected]:CeeZh/LLoVi.git
cd LLoVi

python3 -m venv llovi_env
source activate llovi_env/bin/activate
pip install openai
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install pandas
pip install transformers
pip install accelerate

Download dataset annotations and extracted captions.

Download data.zip from https://drive.google.com/file/d/13M10CB5ePPVlycn754_ff3CwnpPtDfJA/view?usp=drive_link

unzip data.zip

We provide extracted captions for EgoSchema, NeXT-QA, NeXT-GQA and IntentQA at ./data. It also contains dataset annotations.

We used LaViLa base model to extract EgoSchema captions at 1 FPS, LLaVA (llava-hf/llava-1.5-13b-hf) to extract captions for other datasets at 0.5 FPS.

Note that LaViLa is trained on Ego4D, which has overlap with EgoSchema. To avoid data leakage, we trained LaViLa using videos that are not in EgoSchema. You can download the model from this link.

Download experiment results.

Download output.zip from https://drive.google.com/file/d/1d7a-FuQzdfQ7ZAzU5Y8HJpog1gm_sye_/view?usp=drive_link

unzip output.zip

The result files are generated by running the commands in next sections. Note that the result files will be detected, so the commands in next sections will not run directly. Add --start_from_scratch to run the commmands one more time.

EgoSchema

Captioner

# LaViLa
python main.py --output_base_path output/egoschema --output_filename standard_qa.json --api_key YOUR_OPENAI_KEY

# BLIP-2
python main.py --data_path data/egoschema/blip2_fullset.json --output_base_path output/egoschema --output_filename standard_qa_blip2.json --api_key YOUR_OPENAI_KEY
Captioner LLM Prompt Accuracy
LaViLa gpt-3.5-turbo standard 51.2
BLIP-2 gpt-3.5-turbo standard 47.4

LLM

# gpt-3.5-turbo
python main.py --output_base_path output/egoschema --output_filename standard_qa.json --api_key YOUR_OPENAI_KEY

# gpt-3.5-turbo-1106
python main.py --model gpt-3.5-turbo-1106 --output_base_path output/egoschema --output_filename standard_qa_1106.json --api_key YOUR_OPENAI_KEY

# llama2 70B
python main.py --model meta-llama/Llama-2-70b-chat-hf --prompt_type qa_standard_llama --output_base_path output/egoschema --output_filename llama.json

# gpt-4 (please run gpt-3.5-turbo first as backup, otherwise please disable --backup_pred_path)
python main.py --model gpt-4 --backup_pred_path output/egoschema/standard_qa.json --output_base_path output/egoschema --output_filename standard_qa_gpt4.json --api_key YOUR_OPENAI_KEY

# gpt-4-1106
python main.py --model gpt-4-1106-preview --output_base_path output/egoschema --output_filename standard_qa_gpt4_1106.json --api_key YOUR_OPENAI_KEY
Captioner LLM Prompt Accuracy
LaViLa gpt-3.5-turbo standard 51.2
LaViLa gpt-3.5-turbo-1106 standard 55.2
LaViLa Llama-2-70B standard 55.4
LaViLa gpt-4 standard 59.0
LaViLa gpt-4-1106-preview standard 61.2

Prompting

# standard
## gpt-3.5-turbo
python main.py --output_base_path output/egoschema --output_filename standard_qa.json --api_key YOUR_OPENAI_KEY
## gpt-3.5-turbo-1106
python main.py --model gpt-3.5-turbo-1106 --output_base_path output/egoschema --output_filename standard_qa_1106.json --api_key YOUR_OPENAI_KEY

# zero-shot CoT
python main.py --prompt_type qa_zs-cot --output_base_path output/egoschema --output_filename cot.json --api_key YOUR_OPENAI_KEY

# (C, Q) —> S
## gpt-3.5-turbo
### Step 1. generate summary for each example. 
python main.py --task sum --prompt_type sum_q --num_words_in_sum 500 --temperature 1.0 --output_base_path output/egoschema --output_filename sum_q_500.json --api_key YOUR_OPENAI_KEY
### Step 2. feed the summary (instead of raw captions) to the LLM. 
python main.py --prompt_type qa_sum --data_path output/egoschema/sum_q_500_data.json --output_base_path output/egoschema --output_filename qa_sum_q_500.json --api_key YOUR_OPENAI_KEY
## gpt-3.5-turbo-1106
### Step 1. generate summary for each example. 
python main.py --model gpt-3.5-turbo-1106 --task sum --prompt_type sum_q --num_words_in_sum 500 --temperature 1.0 --output_base_path output/egoschema --output_filename sum_q_500_1106.json --api_key YOUR_OPENAI_KEY
### Step 2. feed the summary (instead of raw captions) to the LLM. 
python main.py --model gpt-3.5-turbo-1106 --prompt_type qa_sum --data_path output/egoschema/sum_q_500_1106_data.json --output_base_path output/egoschema --output_filename qa_sum_q_500_1106.json --api_key YOUR_OPENAI_KEY
Captioner LLM Prompt Accuracy
LaViLa gpt-3.5-turbo standard 51.2
LaViLa gpt-3.5-turbo zero-shot CoT 55.2
LaViLa gpt-3.5-turbo (C, Q) —> S 57.4
LaViLa gpt-3.5-turbo-1106 standard 55.2
LaViLa gpt-3.5-turbo-1106 (C, Q) —> S 58.8

Few shot

# standard
python main.py --fewshot_example_path data/egoschema/few_shot_6.json --backup_pred_path output/egoschema/standard_qa.json --prompt_type qa_standard_fewshot --output_base_path output/egoschema --output_filename fewshot.json --api_key YOUR_OPENAI_KEY

# (C, Q) --> S
### Step  1. generate summary for each example. 
python main.py --task sum --prompt_type sum_q --num_words_in_sum 500 --temperature 1.0 --output_base_path output/egoschema --output_filename sum_q_500.json --api_key YOUR_OPENAI_KEY
### (Optional) Step 2. QA without few-shot examples. Use the result as backup predictions. Otherwise, please disable backup_pred_path in the next command. 
python main.py --prompt_type qa_sum --data_path output/egoschema/sum_q_500_data.json --output_base_path output/egoschema --output_filename qa_sum_q_500.json --api_key YOUR_OPENAI_KEY
### Step 3. QA with few-shot examples. 
python main.py --prompt_type qa_standard_fewshot --fewshot_example_path data/egoschema/few_shot_6.json --backup_pred_path output/egoschema/qa_sum_q_500.json --data_path output/egoschema/sum_q_500_data.json --output_base_path output/egoschema --output_filename fewshot_sum.json --api_key YOUR_OPENAI_KEY
Captioner LLM Prompt Accuracy
LaViLa gpt-3.5-turbo standard 57.6
LaViLa gpt-3.5-turbo (C, Q) —> S 60.2

Accuracy on different categories

python eval.py \
--function eval_egoschema_cats \
--data_path output/egoschema/standard_qa.json \
--cats_path data/egoschema/categories.json
Category Percentage Accuracy
Purpose/Goal Identification 49.2 50.4
Tools and Materials Usage 21.8 55.0
Key Action/Moment Detection 21.6 43.5
Action Sequence Analysis 18.2 50.5
Character Interaction 9.4 63.8

NeXT-QA

python main.py \
--dataset nextqa \
--data_path data/nextqa/llava1.5_fps1.json \
--fps 0.5 \
--anno_path data/nextqa/val.csv \
--duration_path data/nextqa/durations.json \
--prompt_type qa_next \
--model gpt-4-1106-preview \
--output_base_path output/nextqa \
--output_filename gpt4_llava.json \
--api_key YOUR_OPENAI_KEY

Accuracy:

Why How Bef&Aft When Cnt Loc Other Acc_C Acc_T Acc_D All
70.95 65.45 54.69 70.14 58.76 82.71 78.36 69.51 61.04 75.55 67.71

Intent-QA

python main.py \
--dataset intentqa \
--data_path data/nextqa/llava1.5_fps1.json \
--fps 0.5 \
--anno_path data/intentqa/test.csv \
--duration_path data/nextqa/durations.json \
--prompt_type qa_next \
--model gpt-4-1106-preview \
--output_base_path output/intentqa \
--output_filename gpt4_llava.json \
--api_key YOUR_OPENAI_KEY
Why How Bef&Aft Total
68.40 67.41 51.05 63.96

NeXT-GQA

# Step 1. QA
python main.py \
--dataset nextgqa \
--data_path data/nextqa/llava1.5_fps1.json \
--fps 0.5 \
--anno_path data/nextgqa/test.csv \
--duration_path data/nextqa/durations.json \
--prompt_type qa_next \
--model gpt-4-1106-preview \
--output_base_path output/nextgqa \
--output_filename gpt4_llava.json \
--api_key YOUR_OPENAI_KEY

# Step 2. Grounding
python main.py \
--dataset nextgqa \
--data_path data/nextqa/llava1.5_fps1.json \
--fps 0.5 \
--anno_path data/nextgqa/test.csv \
--duration_path data/nextqa/durations.json \
--nextgqa_gt_ground_path data/nextgqa/gsub_test.json \
--nextgqa_pred_qa_path output/nextgqa/gpt4_llava.json \
--prompt_type gqa \
--task gqa \
--model gpt-4-1106-preview \
--output_base_path output/nextgqa \
--output_filename gpt4_llava_grounding.json \
--save_info
--api_key YOUR_OPENAI_KEY
Acc&GQA mIoP [email protected] [email protected] mIoU [email protected] [email protected]
24.3 37.3 45.0 36.9 20.0 29.1 15.3

Debug

--save_info: save more information, e.g. token usage, detailed prompts, etc.
--num_examples_to_run: how many examples to run. -1 (default) to run all.
--start_from_scratch: ignore existing output files. Start from scratch.

Citation

If you find this repository useful for your research, please consider citing our work:

@misc{zhang2023simple,
      title={A Simple LLM Framework for Long-Range Video Question-Answering}, 
      author={Ce Zhang and Taixi Lu and Md Mohaiminul Islam and Ziyang Wang and Shoubin Yu and Mohit Bansal and Gedas Bertasius},
      year={2023},
      eprint={2312.17235},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

llovi's People

Contributors

ceezh avatar

Stargazers

 avatar  avatar lpd avatar  avatar Haodong Chen avatar Youngbin Ki avatar  avatar  avatar  avatar hulianyu avatar Zhanwen Chen avatar Ayiyayi avatar Yaya Shi avatar  avatar Jihan Yang avatar Yatindra avatar Parthiban Marimuthu avatar Shiyu Huang avatar Sheng Zhou avatar xiaofof avatar zht8506 avatar Jeff Carpenter avatar  avatar Michael Dorkenwald avatar Zhijian Hou  avatar Kartikeya avatar yeongha-shin avatar  avatar  avatar  avatar hcwei avatar Alex Dorofeev avatar Free Lam avatar Guo Chen avatar Xuchen Li (李旭宸) avatar Qingkai Fang avatar Haiyang Wang avatar elucida avatar tensorboy avatar Xiang Li avatar  avatar  avatar  avatar  avatar Rohit Gupta avatar  avatar Nina Shvetsova avatar JimyMa avatar Rui Tian avatar  avatar ShiJZ avatar Arun Reddy avatar Mohammad Reza Taesiri avatar Thanh Tin Nguyen avatar  avatar Niranjan Anandkumar avatar Ziyang Wang avatar Hashem Elezabi avatar Alexander Dicke avatar Jintao Lin avatar Suhas M L avatar Chukwuma Nwaugha avatar Ziyi Bai avatar Paragoner avatar andrewyu avatar Rohan Choudhury avatar Kumara Kahatapitiya avatar  avatar Matt Shaffer avatar zhaohengyuan avatar  avatar K avatar yptang avatar Sachit Menon avatar Yang Liu avatar Shoubin avatar Zachariah Mustafa avatar Tongjia avatar Vladislav Sorokin avatar Wenhao Chai avatar

Watchers

Prashant Gaikwad avatar  avatar Kostas Georgiou avatar  avatar  avatar Matt Shaffer avatar

llovi's Issues

Performance of LLoVi with 7B llama2

Dear author, thank you for your work! I would like to know the performance of LLoVi on next-qa, next-gqa and IntentQA, when using 7b llama2 as the LLM.
For larger model like gpt3.5 and gpt4, they are not open-source so we cannot do research on how to improve them on this task. So I think it's beneficial for the community to report the performance on smaller llm

Please provide full narration hyper-parameters

Hi ;)

for comparability reasons, it would be beneficial for the community to have insights into the full hyper-parameter setups.
I am especially interested in the LaViLa captioning config to use with your provided fair checkpoint for EgoSchema.
In detail, I need the following information:

  1. You say you use nucleus sampling with top_p=0.95 and choose k=5 for having 5 candidates. What temperature do you use?
  2. Besides, I saw you reporting in the paper that you use a temperature=0.0 for the LLMs, but I see in the readme you provide example commands for the summarization task with temperature=1.0. So do you use temperature=1.0 for LLM in summarization task and temperature=0.0 in QA task?

Clearification would be much appreciated! :)

Cheers,
Maximotus

Inconsistent Results on EgoSchema

Hi, Thanks for your great work! I tried to reproduce your results on EgoSchema but found some inconsistency. Specifically, I tried to reproduce the results with standard prompt and (C, Q) —> S prompt with the following command:

standard prompt

python main.py --model gpt-3.5-turbo-1106 --output_base_path output/egoschema --output_filename standard_qa_1106.json

Results:
    "num_total": 500,
    "num_valids": 453,
    "num_corrects": 266,
    "acc": 0.532,

(C, Q) —> S prompt

python main.py --model gpt-3.5-turbo-1106 --task sum --prompt_type sum_q --num_words_in_sum 500 --temperature 1.0 --output_base_path output/egoschema --output_filename sum_q_500_1106.json

python main.py --model gpt-3.5-turbo-1106 --prompt_type qa_sum --data_path output/egoschema/sum_q_500_1106_data.json --output_base_path output/egoschema --output_filename qa_sum_q_500_1106.json

Results:
    "num_total": 500,
    "num_valids": 493,
    "num_corrects": 278,
    "acc": 0.556,

However, it seems the results are different with the reported results in the README:

LaViLa	gpt-3.5-turbo-1106	standard	55.2
LaViLa	gpt-3.5-turbo-1106	(C, Q) —> S	58.8

I have not modified any code and use the captions you released. Any possible reasons for the inconsistency? I also noticed that the results in the README are slightly different with those in the paper. Could you please tell me what is the reason behind? Thank you!

Best regards

Inquiry about resources and processng time

Hi,
I am currently working with the NExT-QA dataset and I ran your code using the model meta-llama/Meta-Llama-3-8B, as GPT-3.5 and GPT-4 are not open source. Could you please provide details on the resources you used to achieve the results with the NExT-QA dataset? Additionally, how long did it take for 1000 annotations to be processed?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.