Git Product home page Git Product logo

cognitive-visual-language-mapper's Introduction

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

🎁 Dataset

The evaluation dataset of the technical paper "A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering" is shown in Huggingface Knowledge-intensive Dataset

We released two million Wikipedia Knowledge Datasets in Wikipedia-Knowledge-2M. The dataset includes a JSON file and a compressed archive containing all the image files. The JSON file's image attributes correspond to the compressed archive's image files.

We have also provided the JSON file for the 504K KonwledgeQA dataset in LLaVA-KnowledgeQA-504K. The dataset mainly consists of the training sets from OK-VQA, TextVQA, A-OKVQA, and TextVQA. The images in this dataset come from COCO Caption and TextVQA, which you will need to download yourself.

🔍 Environment

  • Pytorch 2.0.1
conda env create -n CVLM python=3.8
conda activate CVLM
pip install -r requirement.txt

🐎 Train

Pretraining Visual Knowledge Aligner

Before you start the pretraining for the visual knowledge aligner, you should place the downloaded Wikipedia-Knowledge-2M dataset in LLaVA/playground/knowledge_data directory.

Then you can use the following scripts for pretraining.

cd LLaVA
export PYTHONPATH=path_to_current_dir
bash scripts/decoder_model/pretrain_knowledge.sh

1️⃣ LLaVA

Training Visual Knowledge Aligner with LLMs

Replace pretrain_opt_adapter with the save path of your pretrained VKA.

bash scripts/knowledge/pretrain.sh

You should use the code to extract trainable parameters from the saved checkpoints file and store them as inputs in the next stage of training.

Fine-tune VKA on the Question Answering task

Change the attribute pretrain_knowledge_params_path to the path where the parameters extracted in the previous stage are stored.

bash scripts/knowledge_qa/llava_vka_qa.sh

Besides, after completing the training, you can use the code to extract both trainable non-LoRA parameters and LoRA parameters from the checkpoints.

Fine-tune FKA on the Question Answering task.

Finally, we used a two-stage training method when fine-tuning FKA.

bash scripts/knowledge_qa/llava_fka_qa.sh
bash scripts/knowledge_qa/llava_fka_qa_stage2.sh

It is important to note that during each stage of training, the parameters from the previous stage need to be accessed via attribute pretrain_knowledge_params_path, and the parameters should be extraxted by code.

2️⃣ Qwen-VL

Training Visual Knowledge Aligner with LLMs

This stage of training also requires loading the training parameters from the Pretraining Visual Knowledge Aligner. You need to modify attribute pretrain_opt_adapter by your save path.

cd Qwen
bash finetune/pretrain_ds.sh

Fine-tune VKA on the Question Answering task

bash finetune/finetune_lora_ds.sh

🔶 Evaluation

The sam_images on GitHub are incomplete; you need to re-download them from Hugging Face.

We released the best model based on LLaVA on CVLM-LLaVA and the best model based on QWen-VL on CVLM-Qwen

After downloading checkpoints, organize the weights as follows.

└── LLaVA
    ├──checkpoints
        ├──CVLM-LLaVA
└── Qwen
    ├──checkpoints
        ├──CVLM-Qwen
            ├──qwen-pretrain
            ├──qwen-vka

1️⃣ LLaVA

The evaluation scripts of LLaVA are on scripts/knowledge_qa/eval,

We mainly evaluated six benchmark datasets: OK-VQA, VQAv2, A-OKVQA, TextVQA, InfoSeek, and SEED-Bench.

**Before your evaluation, you should unzip the images generated by SAM.

cd LLaVA\playground\knowledge_qa\sam
tar -xzvf images_all.tar.gz

OK-VQA

Just so you know, the saved result files will be in the answers_upload folder within the corresponding directory.

bash scripts/knowledge_qa/eval/okvqa.sh
cd /data/cxy/Knowledge_LLaVA/upload/playground/knowledge_qa/eval/okvqa
python okvqa_eval.py --pred_file your_save_path

VQAv2

bash scripts/knowledge_qa/eval/vqav2.sh
cd /data/cxy/Knowledge_LLaVA/upload/playground/knowledge_qa/eval/vqav2
python vqa_eval.py --pred_file your_save_path

A-OKVQA

Evaluation on open-ended A-OKVQA. The following scripts will also perform the evaluation.

bash scripts/knowledge_qa/eval/aokvqa_oe.sh

Evaluation on multi-choices A-OKVQA.

bash scripts/knowledge_qa/eval/aokvqa.sh

TextVQA

Evaluation on TextVQA.

bash scripts/knowledge_qa/eval/textvqa.sh

InfoSeek

Evaluation on InfoSeek.

bash scripts/knowledge_qa/eval/infoseek.sh

SEED-Bench

Evaluation on SEED-Bench.

bash scripts/knowledge_qa/eval/seedbench.sh

2️⃣ Qwen

The Qwen model is evaluated using the same datasets as the LLaVA model.

Before you evaluate the Qwen-VL model, you need to download the Qwen-VL model from Qwen-VL and use the two Python files under path to replace the original files.

OK-VQA

python eval_mm/evaluate_vqa.py --checkpoint checkpoints/CVLM-Qwen/qwen-pretrain --adapter checkpoints/CVLM-Qwen/qwen-vka --dataset okvqa --few-shot 0

VQAv2

python eval_mm/evaluate_vqa.py --checkpoint checkpoints/CVLM-Qwen/qwen-pretrain --adapter checkpoints/CVLM-Qwen/qwen-vka --dataset vqav2 --few-shot 0

A-OKVQA

Evaluation on open-ended A-OKVQA.

python eval_mm/evaluate_vqa.py --checkpoint checkpoints/CVLM-Qwen/qwen-pretrain --adapter checkpoints/CVLM-Qwen/qwen-vka --dataset aokvqa --few-shot 0

Evaluation on multi-choices A-OKVQA.

python eval_mm/evaluate_multiple_choice_generated.py --checkpoint checkpoints/CVLM-Qwen/qwen-pretrain --adapter checkpoints/CVLM-Qwen/qwen-vka --dataset aokvqa --few-shot 0

TextVQA

python eval_mm/evaluate_vqa.py --checkpoint checkpoints/CVLM-Qwen/qwen-pretrain --adapter checkpoints/CVLM-Qwen/qwen-vka --dataset textvqa --few-shot 0

InfoSeek

python eval_mm/evaluate_vqa.py --checkpoint checkpoints/CVLM-Qwen/qwen-pretrain --adapter checkpoints/CVLM-Qwen/qwen-vka --dataset infoseek --few-shot 0

SeedBench

python eval_mm/evaluate_multiple_choice_generated.py --checkpoint checkpoints/CVLM-Qwen/qwen-pretrain --adapter checkpoints/CVLM-Qwen/qwen-vka --dataset seedbench --few-shot 0

Citation

If you find our paper and code useful in your research, please consider giving a star and citation

@article{li2024cognitive,
  title={Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment},
  author={Li, Yunxin and Chen, Xinyu and Hu, Baotian and Shi, Haoyuan and Zhang, Min},
  journal={arXiv preprint arXiv:2402.13561},
  year={2024}
}
@article{li2023comprehensive,
  title={A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering},
  author={Li, Yunxin and Wang, Longyue and Hu, Baotian and Chen, Xinyu and Zhong, Wanqi and Lyu, Chenyang and Zhang, Min},
  journal={arXiv preprint arXiv:2311.07536},
  year={2023}
}

Acknowledgments

  • LLaVA: the one of codebase we built upon. Thanks for their wonderful work.

  • Qwen-VL: the another codebase we built upon. Thanks for their wonderful work.

cognitive-visual-language-mapper's People

Contributors

xychen-hitsz avatar yunxinli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

cognitive-visual-language-mapper's Issues

Code release

Thanks for your excellent work!!! When will the code be made public?

关于Evaluation部分的问题

你好,我尝试根据Readme复现Evaluation的部分。
根据提示,我组织了checkpoint,unzip了图像。

但当我尝试执行bash命令的时候,会出现“huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/data/share/Model/opt-1.3b'. Use repo_type argument if needed.”的错误。

我不是很确定/data/share/Model/opt-1.3b是哪一部分。感谢您的帮助。
Screenshot 2024-08-16 at 07 16 19
Screenshot 2024-08-16 at 07 16 35

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.