Git Product home page Git Product logo

open-compass / vlmevalkit Goto Github PK

View Code? Open in Web Editor NEW
902.0 11.0 118.0 2.32 MB

Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 30+ benchmarks

Home Page: https://huggingface.co/spaces/opencompass/open_vlm_leaderboard

License: Apache License 2.0

Python 97.14% Shell 0.07% Jupyter Notebook 2.79%
gpt-4v large-language-models llava multi-modal openai vqa llm openai-api qwen gpt

vlmevalkit's Introduction

VLMEvalKit (the python package name is vlmeval) is an open-source evaluation toolkit of large vision-language models (LVLMs). It enables one-command evaluation of LVLMs on various benchmarks, without the heavy workload of data preparation under multiple repositories. In VLMEvalKit, we adopt generation-based evaluation for all LVLMs, and provide the evaluation results obtained with both exact matching and LLM-based answer extraction.

🆕 News

  • [2024-08-20] We optimized the evaluation pipeline of MMMB and Multilingual MMBench, now you can use the name MMMB and MTL_MMBench_DEV to obtain the results of 6 langs at the a time
  • [2024-08-19] We have supported Llama-3-MixSenseV1_1, thanks to Zero-Vision 🔥🔥🔥
  • [2024-08-12] We have supported MMMB and Multilingual MMBench, thanks to Hai-Long Sun🔥🔥🔥
  • [2024-08-09] We have supported Hunyuan-Vision, evaluation results coming soon🔥🔥🔥
  • [2024-08-08] We created a HuggingFace Dataset: OpenVLMRecords to keep all our evaluation records. You can find sample-level predictions of all evaluated benchmarks there🔥🔥🔥
  • [2024-08-08] We have supported MiniCPM-V 2.6, thanks to lihytotoro🔥🔥🔥
  • [2024-08-07] We have supported two new multi-image understanding benchmarks: DUDE and SlideVQA, thanks to mayubo2333🔥🔥🔥
  • [2024-08-06] We have supported TaskMeAnything ImageQA-Random Dataset, thanks to weikaih04🔥🔥🔥
  • [2024-08-05] We have supported a new evaluation strategy for AI2D, which do not mask the corresponding areas when choices are uppercase letters. Instead, the area is annotated by a rectangle contour. Set the dataset name to AI2D_TEST_NO_MASK to evaluate under this setting (The leaderboard now is still using the previous setting)
  • [2024-08-05] We have supported Mantis, thanks to BrenchCC🔥🔥🔥

📊 Datasets, Models, and Evaluation Results

The performance numbers on our official multi-modal leaderboards can be downloaded from here!

OpenVLM Leaderboard: Download All DETAILED Results.

Supported Image Understanding Dataset

  • By default, all evaluation results are presented in OpenVLM Leaderboard.
  • Abbrs: MCQ: Multi-choice question; Y/N: Yes-or-No Questions; MTT: Benchmark with Multi-turn Conversations; MTI: Benchmark with Multi-Image as Inputs.
Dataset Dataset Names (for run.py) Task Dataset Dataset Names (for run.py) Task
MMBench Series:
MMBench, MMBench-CN, CCBench
MMBench_DEV_[EN/CN]
MMBench_TEST_[EN/CN]
MMBench_DEV_[EN/CN]_V11
MMBench_TEST_[EN/CN]_V11
CCBench
MCQ MMStar MMStar MCQ
MME MME Y/N SEEDBench Series SEEDBench_IMG
SEEDBench2
SEEDBench2_Plus
MCQ
MM-Vet MMVet VQA MMMU MMMU_[DEV_VAL/TEST] MCQ
MathVista MathVista_MINI VQA ScienceQA_IMG ScienceQA_[VAL/TEST] MCQ
COCO Caption COCO_VAL Caption HallusionBench HallusionBench Y/N
OCRVQA* OCRVQA_[TESTCORE/TEST] VQA TextVQA* TextVQA_VAL VQA
ChartQA* ChartQA_TEST VQA AI2D AI2D_[TEST/TEST_NO_MASK] MCQ
LLaVABench LLaVABench VQA DocVQA+ DocVQA_[VAL/TEST] VQA
InfoVQA+ InfoVQA_[VAL/TEST] VQA OCRBench OCRBench VQA
RealWorldQA RealWorldQA MCQ POPE POPE Y/N
Core-MM- CORE_MM (MTI) VQA MMT-Bench MMT-Bench_[VAL/ALL]
MMT-Bench_[VAL/ALL]_MI
MCQ (MTI)
MLLMGuard - MLLMGuard_DS VQA AesBench+ AesBench_[VAL/TEST] MCQ
VCR-wiki + VCR_[EN/ZH]_[EASY/HARD]_[ALL/500/100] VQA MMLongBench-Doc+ MMLongBench_DOC VQA (MTI)
BLINK BLINK MCQ (MTI) MathVision+ MathVision
MathVision_MINI
VQA
MT-VQA+ MTVQA_TEST VQA MMDU+ MMDU VQA (MTT, MTI)
Q-Bench1+ Q-Bench1_[VAL/TEST] MCQ A-Bench+ A-Bench_[VAL/TEST] MCQ
DUDE+ DUDE VQA (MTI) SlideVQA+ SLIDEVQA
SLIDEVQA_MINI
VQA (MTI)
TaskMeAnything ImageQA Random+ TaskMeAnything_v1_imageqa_random MCQ MMMB and Multilingual MMBench+ MMMB_[ar/cn/en/pt/ru/tr]
MMBench_dev_[ar/cn/en/pt/ru/tr]
MMMB
MTL_MMBench_DEV
PS: MMMB & MTL_MMBench_DEV
are all-in-one names for 6 langs
MCQ

* We only provide a subset of the evaluation results, since some VLMs do not yield reasonable results under the zero-shot setting

+ The evaluation results are not available yet

- Only inference is supported in VLMEvalKit

VLMEvalKit will use a judge LLM to extract answer from the output if you set the key, otherwise it uses the exact matching mode (find "Yes", "No", "A", "B", "C"... in the output strings). The exact matching can only be applied to the Yes-or-No tasks and the Multi-choice tasks.

Supported Video Understanding Dataset

Dataset Dataset Names (for run.py) Task Dataset Dataset Names (for run.py) Task
MMBench-Video MMBench-Video VQA Video-MME Video-MME MCQ

Supported API Models

GPT-4v (20231106, 20240409) 🎞️🚅 GPT-4o 🎞️🚅 Gemini-1.0-Pro 🎞️🚅 Gemini-1.5-Pro 🎞️🚅 Step-1V 🎞️🚅
Reka-[Edge / Flash / Core]🚅 Qwen-VL-[Plus / Max] 🎞️🚅 Claude3-[Haiku / Sonnet / Opus] 🎞️🚅 GLM-4v 🚅 CongRong 🎞️🚅
Claude3.5-Sonnet 🎞️🚅 GPT-4o-Mini 🎞️🚅 Yi-Vision🎞️🚅 Hunyuan-Vision🎞️🚅

Supported PyTorch / HF Models

IDEFICS-[9B/80B/v2-8B]-Instruct🎞️🚅 InstructBLIP-[7B/13B] LLaVA-[v1-7B/v1.5-7B/v1.5-13B] MiniGPT-4-[v1-7B/v1-13B/v2-7B]
mPLUG-Owl2🎞️ OpenFlamingo-v2🎞️ PandaGPT-13B Qwen-VL🎞️🚅, Qwen-VL-Chat🎞️🚅
VisualGLM-6B🚅 InternLM-XComposer-[1/2]🚅 ShareGPT4V-[7B/13B]🚅 TransCore-M
LLaVA (XTuner)🚅 CogVLM-[Chat/Llama3]🚅 ShareCaptioner🚅 CogVLM-Grounding-Generalist🚅
Monkey🚅, Monkey-Chat🚅 EMU2-Chat🚅🎞️ Yi-VL-[6B/34B] MMAlaya🚅
InternLM-XComposer-2.5🚅🎞️ MiniCPM-[V1/V2/V2.5/V2.6]🚅🎞️ OmniLMM-12B InternVL-Chat-[V1-1/V1-2/V1-5/V2]🚅🎞️,
Mini-InternVL-Chat-[2B/4B]-V1-5🚅🎞️
DeepSeek-VL🎞️ LLaVA-NeXT🚅🎞️ Bunny-Llama3🚅 XVERSE-V-13B
PaliGemma-3B 🚅 360VL-70B 🚅 Phi-3-Vision🚅 WeMM🚅
GLM-4v-9B 🚅 Cambrian-[8B/13B/34B] LLaVA-Next-[Qwen-32B] 🎞️ Chameleon-[7B/30B]🚅🎞️
Video-LLaVA-7B-[HF] 🎬 VILA1.5-[3B/8B/13B/40B]🎞️ Ovis1.5-[Llama3-8B/Gemma2-9B] 🚅🎞️ Mantis-8B-[siglip-llama3/clip-llama3/Idefics2/Fuyu] 🎞️
Llama-3-MixSenseV1_1🚅 Parrot-7B 🚅 OmChat-v2.0-13B-sinlge-beta 🚅

🎞️: Support multiple images as inputs.

🚅: Models can be used without any additional configuration/operation.

🎬: Support Video as inputs.

Transformers Version Recommendation:

Note that some VLMs may not be able to run under certain transformer versions, we recommend the following settings to evaluate each VLM:

  • Please use transformers==4.33.0 for: Qwen series, Monkey series, InternLM-XComposer Series, mPLUG-Owl2, OpenFlamingo v2, IDEFICS series, VisualGLM, MMAlaya, ShareCaptioner, MiniGPT-4 series, InstructBLIP series, PandaGPT, VXVERSE, GLM-4v-9B.
  • Please use transformers==4.37.0 for: LLaVA series, ShareGPT4V series, TransCore-M, LLaVA (XTuner), CogVLM Series, EMU2 Series, Yi-VL Series, MiniCPM-[V1/V2], OmniLMM-12B, DeepSeek-VL series, InternVL series, Cambrian Series, VILA Series, Llama-3-MixSenseV1_1, Parrot-7B.
  • Please use transformers==4.40.0 for: IDEFICS2, Bunny-Llama3, MiniCPM-Llama3-V2.5, 360VL-70B, Phi-3-Vision, WeMM.
  • Please use transformers==latest for: LLaVA-Next series, PaliGemma-3B, Chameleon series, Video-LLaVA-7B-HF, Ovis series, Mantis series, MiniCPM-V2.6, OmChat-v2.0-13B-sinlge-beta.
# Demo
from vlmeval.config import supported_VLM
model = supported_VLM['idefics_9b_instruct']()
# Forward Single Image
ret = model.generate(['assets/apple.jpg', 'What is in this image?'])
print(ret)  # The image features a red apple with a leaf on it.
# Forward Multiple Images
ret = model.generate(['assets/apple.jpg', 'assets/apple.jpg', 'How many apples are there in the provided images? '])
print(ret)  # There are two apples in the provided images.

🏗️ QuickStart

See [QuickStart | 快速开始] for a quick start guide.

🛠️ Development Guide

To develop custom benchmarks, VLMs, or simply contribute other codes to VLMEvalKit, please refer to [Development_Guide | 开发指南].

Call for contributions

To promote the contribution from the community and share the corresponding credit (in the next report update):

  • All Contributions will be acknowledged in the report.
  • Contributors with 3 or more major contributions (implementing an MLLM, benchmark, or major feature) can join the author list of VLMEvalKit Technical Report on ArXiv. Eligible contributors can create an issue or dm kennyutc in VLMEvalKit Discord Channel.

🎯 The Goal of VLMEvalKit

The codebase is designed to:

  1. Provide an easy-to-use, opensource evaluation toolkit to make it convenient for researchers & developers to evaluate existing LVLMs and make evaluation results easy to reproduce.
  2. Make it easy for VLM developers to evaluate their own models. To evaluate the VLM on multiple supported benchmarks, one just need to implement a single generate_inner() function, all other workloads (data downloading, data preprocessing, prediction inference, metric calculation) are handled by the codebase.

The codebase is not designed to:

  1. Reproduce the exact accuracy number reported in the original papers of all 3rd party benchmarks. The reason can be two-fold:
    1. VLMEvalKit uses generation-based evaluation for all VLMs (and optionally with LLM-based answer extraction). Meanwhile, some benchmarks may use different approaches (SEEDBench uses PPL-based evaluation, eg.). For those benchmarks, we compare both scores in the corresponding result. We encourage developers to support other evaluation paradigms in the codebase.
    2. By default, we use the same prompt template for all VLMs to evaluate on a benchmark. Meanwhile, some VLMs may have their specific prompt templates (some may not covered by the codebase at this time). We encourage VLM developers to implement their own prompt template in VLMEvalKit, if that is not covered currently. That will help to improve the reproducibility.

🖊️ Citation

If you find this work helpful, please consider to star🌟 this repo. Thanks for your support!

Stargazers repo roster for @open-compass/VLMEvalKit

If you use VLMEvalKit in your research or wish to refer to published OpenSource evaluation results, please use the following BibTeX entry and the BibTex entry corresponding to the specific VLM / benchmark you used.

@misc{duan2024vlmevalkit,
      title={VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models},
      author={Haodong Duan and Junming Yang and Yuxuan Qiao and Xinyu Fang and Lin Chen and Yuan Liu and Xiaoyi Dong and Yuhang Zang and Pan Zhang and Jiaqi Wang and Dahua Lin and Kai Chen},
      year={2024},
      eprint={2407.11691},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.11691},
}

🔝Back to top

vlmevalkit's People

Contributors

amitbcp avatar brenchcc avatar cuiunbo avatar czczup avatar dseidli avatar dylanqyuan avatar eltociear avatar ezra-yu avatar fangxinyu-0913 avatar fitzpchao avatar isaachhh avatar iyuge2 avatar junming-yang avatar kennymckormick avatar lightdxy avatar lzhgrla avatar mary-0830 avatar mayubo2333 avatar pciresearch avatar runninglsy avatar sheryc avatar shuozhang2003 avatar sparksjoe avatar starcycle avatar sun-hailong avatar tianyu-z avatar tousenkaname avatar weikaih04 avatar yuanliuuuuuu avatar zeyofu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vlmevalkit's Issues

Cannot reproduce llava v1.5 7b SEEDBench_IMG results

When i use default setting to initiate llava v1.5 7b model evaluation on SEEDBench_IMG dataset , I got results like this :
屏幕截图 2024-01-04 211703

I checked intermediate results , but the model seems to generate options correctly :
屏幕截图 2024-01-04 212152

The default generation config should be

  • do_sample=True
  • temperature=0.2
  • max_new_tokens=512
  • top_p=None
  • num_beams=1

And the officially reported results should be
屏幕截图 2024-01-04 212602

It's really weird . I don't know why there is a huge gap here. Hope to get help. Thank you in advance!

IndexError: index 1 is out of bounds for dimension 0 with size 1

cur_image_features = image_features[cur_image_idx]
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
IndexError: index 1 is out of bounds for dimension 0 with size 1

Is there any reason why previous version can evaluate normally, but updated from git got such error?

And this error happens only during evaluation at about 6%

Is there zero images samples in eval MMBench?

MathVista测试问题

请问MathVista-mini的Prefetch rate和Acc有什么区别?我测试发现Prefetch rate为52.8,acc只有44.1

【提问】关于其他测试集的支持

hi,
感谢你们团队的工作。
我想咨询一下,

  1. 请问后续会支持GQA,OKVQA,CMMMU这些测试集的推理评估吗?
  2. 后续会像opencompass一样,支持调用api进行多模态评估吗?
  3. chartqa,textvqa这些分数对不上官方论文的数值,请问后续会进行优化吗?

ModuleNotFoundError: No module named 'xtuner.parallel'

I met this problem when test by cmd:
torchrun --nproc-per-node=8 --nnodes=1 --node_rank=0 --master_addr 10.255.244.33 --master_port 8109 run.py --data LLaVABench --model llava-internlm2-20b --verbose

Traceback (most recent call last):
File "/code/src/VLMEvalKit/run.py", line 153, in
main()
File "/code/src/VLMEvalKit/run.py", line 83, in main
model = infer_data_job(
File "/code/src/VLMEvalKit/vlmeval/inference.py", line 210, in infer_data_job
model = infer_data(
File "/code/src/VLMEvalKit/vlmeval/inference.py", line 142, in infer_data
response = model.generate(prompt=struct['text'], image_path=struct['image'], dataset=dataset_name)
File "/code/src/VLMEvalKit/vlmeval/vlm/llava_xtuner.py", line 177, in generate
from xtuner.model.utils import prepare_inputs_labels_for_multimodal
File "/usr/local/lib/python3.10/dist-packages/xtuner/model/init.py", line 3, in
from .sft import SupervisedFinetune
File "/usr/local/lib/python3.10/dist-packages/xtuner/model/sft.py", line 16, in
from xtuner.parallel.sequence import (get_sequence_parallel_world_size,
ModuleNotFoundError: No module named 'xtuner.parallel'
[2024-04-03 03:58:31,981] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2207729) of binary: /bin/python

llava 34B 评测时CUDA out of memory

你好,我在评测llava 34B 时,有8张卡可用,发现只占了一张卡进行推理,导致CUDA out of memory。请问llava 34B的评测支持多卡评测吗?谢谢

If i want to use model of LLaVa in VLMEval, which version should i installed for these?

The error below appears when I install LLava v1.1.3 after having installed latest VLMEval:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.                                                                            
jupyterlab 4.1.2 requires httpx>=0.25.0, but you have httpx 0.24.0 which is incompatible.                                    
xtuner 0.1.13 requires transformers!=4.34.1,!=4.35.0,!=4.35.1,!=4.35.2,>=4.32.1, but you have transformers 4.31.0 which is incompatible.                                                                                                                  
vlmeval 0.1.0 requires gradio==4.15.0, but you have gradio 3.35.2 which is incompatible.                                     
vlmeval 0.1.0 requires transformers==4.33.0, but you have transformers 4.31.0 which is incompatible. 

HallusionBench的最终结果如何计算

HallusionBench最后的结果包含aAcc,fAcc,qAcc,请问最后的准确率是如何计算的?是否是这三者的平均? 另外MME的评测只包含preception部分吗?排行榜上好像是既包含preception又包含cognition。

[Feature Request] To evaluate MMMU test set, you need to transfer the xlsx output to a json file

Hello,

When using VLMEvalKit with MMMU_TEST, you will generate a xlsx output file, e.g.,

image

This format cannot be accepted by the online MMMU EvalAI server. The server requires this json format.

The following code can transfer the xlsx file to the required json format:

import pandas as pd
import json

# 读取xlsx文件
def read_xlsx(file_path):
    # 使用pandas读取xlsx文件
    df = pd.read_excel(file_path, engine='openpyxl')
    return df

# 转换为单个字典的json格式
def convert_to_single_json(df):
    # 选择第一列和第23列
    selected_columns = df.iloc[:, [0, 22]]
    
    # 创建一个空字典用于存储结果
    result_dict = {}
    
    # 遍历每一行数据
    for index, row in selected_columns.iterrows():
        # 使用第一列的值作为键,第23列的值作为值
        result_dict[row[0]] = row[1]
        
    # 将字典转换为json格式的字符串
    json_data = json.dumps(result_dict, indent=4)
    
    return json_data

# 主函数
def main():
    # xlsx文件路径
    file_path = 'hpt-air-mmmu_MMMU_TEST.xlsx'  # 请替换为你的xlsx文件路径
    
    # 读取xlsx文件
    df = read_xlsx(file_path)
    
    # 转换为单个字典的json格式
    json_data = convert_to_single_json(df)
    
    # 输出json数据
    print(json_data)
    
    # 将json数据保存到文件
    with open('hpt-air-mmmu_MMMU_TEST.json', 'w') as f:
        f.write(json_data)

if __name__ == '__main__':
    main()

Would you like to add it into VLMEval?

Best,
StarCycle

llava_v1.5_7b wrong results on Seedbench_IMG

Hi,

I have checked the saved result of llava7b on the seedbench_image benchmark and found that: for some problems, the llava7b can give the right prediction but in the evaluation framework it shows that it will return wrong answers.

For example, for index = 1198,4307
output1198
Which object is likely found in the boy's hand? A: A book B: A soccer ball C: A calculator D: A pencil

output4307
Where is the priest located in the image? A: In front of the stained glass window B: To the right of the bride and groom C: To the left of the bride and groom D: Behind the bride and groom

Can you help explain this? Thanks for your help!

OSError: Incorrect path_or_model_id: 'xtuner/llava-internlm2-20b/projector'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

torchrun --nproc-per-node=8 --nnodes=1 --node_rank=0 --master_addr 10.255.xxx.xxx --master_port 8109 run.py --data LLaVABench --model llava-internlm2-20b --verbose

But I met the following problem below:
Traceback (most recent call last):
File "/train-xxx/code/xxx/src/test_scripts/test.py", line 6, in
projector = AutoModel.from_pretrained(projector_path,
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 484, in from_pretrained
resolved_config_file = cached_file(
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 462, in cached_file
raise EnvironmentError(
OSError: Incorrect path_or_model_id: 'xtuner/llava-internlm2-20b/projector'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

I also used the following script to have a test but got the same error by my side.

import os.path as osp
import torch
from transformers import AutoModel

projector_path="xtuner/llava-internlm2-20b/projector"
projector = AutoModel.from_pretrained(projector_path,
                                              trust_remote_code=True,
                                              torch_dtype=torch.float16,
                                              device_map='cpu')

[疑问] 实验分数对不上,使用llava_v1.5_7b在MMMU_DEV_VAL测试集上的分数对不上

环境:

{'CUDA available': True,
 'CUDA_HOME': '/usr/local/cuda-11.7',
 'GCC': 'gcc (GCC) 8.4.1 20200928 (Anolis 8.4.1-1.0.1)',
 'GPU 0,1,2,3,4,5,6,7': 'NVIDIA A100-PCIE-40GB',
 'MMEngine': '0.10.3',
 'MUSA available': False,
 'NVCC': 'Cuda compilation tools, release 11.7, V11.7.99',
 'OpenCV': '4.9.0',
 'PyTorch': '1.13.1+cu117',
 'PyTorch compiling details': 'PyTorch built with:\n'
                              '  - GCC 9.3\n'
                              '  - C++ Version: 201402\n'
                              '  - Intel(R) Math Kernel Library Version '
                              '2020.0.0 Product Build 20191122 for Intel(R) 64 '
                              'architecture applications\n'
                              '  - Intel(R) MKL-DNN v2.6.0 (Git Hash '
                              '52b5f107dd9cf10910aaa19cb47f3abf9b349815)\n'
                              '  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
                              '  - LAPACK is enabled (usually provided by '
                              'MKL)\n'
                              '  - NNPACK is enabled\n'
                              '  - CPU capability usage: AVX2\n'
                              '  - CUDA Runtime 11.7\n'
                              '  - NVCC architecture flags: '
                              '-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86\n'
                              '  - CuDNN 8.5\n'
                              '  - Magma 2.6.1\n'
                              '  - Build settings: BLAS_INFO=mkl, '
                              'BUILD_TYPE=Release, CUDA_VERSION=11.7, '
                              'CUDNN_VERSION=8.5.0, '
                              'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
                              'CXX_FLAGS= -fabi-version=11 -Wno-deprecated '
                              '-fvisibility-inlines-hidden -DUSE_PTHREADPOOL '
                              '-fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM '
                              '-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK '
                              '-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE '
                              '-DEDGE_PROFILER_USE_KINETO -O2 -fPIC '
                              '-Wno-narrowing -Wall -Wextra '
                              '-Werror=return-type -Werror=non-virtual-dtor '
                              '-Wno-missing-field-initializers '
                              '-Wno-type-limits -Wno-array-bounds '
                              '-Wno-unknown-pragmas -Wunused-local-typedefs '
                              '-Wno-unused-parameter -Wno-unused-function '
                              '-Wno-unused-result -Wno-strict-overflow '
                              '-Wno-strict-aliasing '
                              '-Wno-error=deprecated-declarations '
                              '-Wno-stringop-overflow -Wno-psabi '
                              '-Wno-error=pedantic -Wno-error=redundant-decls '
                              '-Wno-error=old-style-cast '
                              '-fdiagnostics-color=always -faligned-new '
                              '-Wno-unused-but-set-variable '
                              '-Wno-maybe-uninitialized -fno-math-errno '
                              '-fno-trapping-math -Werror=format '
                              '-Werror=cast-function-type '
                              '-Wno-stringop-overflow, LAPACK_INFO=mkl, '
                              'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
                              'PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, '
                              'USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, '
                              'USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, '
                              'USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, '
                              'USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n',
 'Python': '3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]',
 'TorchVision': '0.14.1+cu117',
 'numpy_random_seed': 2147483648,
 'opencompass': '0.2.2+4bd2256',
 'sys.platform': 'linux'}

命令:python run.py --data MMMU_DEV_VAL --model llava_v1.5_7b --verbose

输出:
image

llava_v1.5_7b_MMMU_DEV_VAL_acc.csv 文件保存的结果是如下所示。

"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"dev","0.006666666666666667","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.2","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.02857142857142857"
"validation","0.014444444444444444","0.0","0.0","0.0","0.0","0.0","0.03333333333333333","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.03333333333333333","0.06666666666666667","0.0","0.13333333333333333","0.1","0.06666666666666667","0.0","0.0","0.0","0.0","0.03333333333333333","0.016666666666666666","0.02666666666666667","0.009523809523809525"

请问为什么用这个命令推理评估的结果与官方差距那么大?

Detailed results of ScienceQA-IMG

Thanks for the great effort of this repo! I see you provide the zero-shot results of several MLLMs on ScienceQA-IMG dataset. Could you please add the detailed results (i.e., NAT, SOC, LAN) of the TEST and VAL partitions?

ChartQA augmented & CMMMU

It would be nice if VLMEvalKit can support evaluation on ChartQA augmented set and CMMMU, since it has supported ChartQA human set and MMMU

How to calculate the average rank?

image
In the leaderboard, [LLaVA-InternLM2-20B (QLoRA)] get higher average score than Monkey-Chat, but Monkey-Chat rank higher. So how to calculate the Avg.Rank as the leaderboard shows?

A major problem with the multiple-choice evaluation

There is a major problem with the multiple-choice evaluation.
I am testing MMBench-dev-en here, I use the result document generated by the llava framework--llava_MMBench_DEV_EN.xlsx, the result of your test here is 0.68.

Because his prediction is all a single word--just the options, so I tried to match it myself, just if item['prediction']==item['answer'], and I found that the final result is 0.77, so your test standard is seriously wrong, or I missed something, please let me know.

If you want the result file to test, I can send you the result file, or you can just have a check.

image

Unknown error when loading LLaVA model

when I run the command
python run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model llava_v1.5_13b --verbose
it shows
warnings.warn('Unknown error when loading LLaVA model.')
image

how to deal with this?

MMMU test set

It will be nice if VLMEvalKit can generate result for MMMU test set,

There is a long gap between the validation accuracy of the dataset of vlmevalkit and the model paper

On the TextVQA dataset, the paper in Instructblip 13b indicates that its precision is 50.7, and the paper in Qwen VL Chat shows an accuracy of 63.75.
In terms of the accuracy measured by the vlmevalkit official, the accuracy of Instructblip 13b is about 30, and the accuracy of QWEN VL Chat is 10.5, what do you think is the problem?
Also, I tested the accuracy of Instructblip 13b on textVQA and found that I ran with an accuracy of 16.7, what went wrong? These are all the results of prefech, and GPT is not used.

Evaluation of custom models and datasets.

VLMEVALKIT is a pretty convenient evaluation tool for MLLMs. I hope that the esteemed authors can create a framework for VLMEVALKIT that supports the evaluation of custom models and custom datasets. This framework can define a unified MLLM input-output interface and the conversion format for datasets.

Questions regarding the metrics for SEED bench

Hi,

Thanks for putting up the benchmark and releasing the eval tool. I'm running some experiments on both MMBench and the SEED bench, where I'm having some confusion regarding the metrics in the SEED leaderboard and would appreciate any inputs.

image

Specifically, I have three questions.

  1. What does "heuristic matching" mean in ExactMatchRate?
  2. I'm not fully understanding the definition of MatchedAcc and ExactMatchAcc (and the difference between them). Would you mind explaining it with a concrete example?
  3. It is mentioned, for the official SEED leaderboard, that For models with limited instruction following capabilities (including qwen_base, MiniGPT-4, InstructBLIP, flamingov2), the performance gap between generation-based evaluation and PPL-based evaluation is significant. I understand what PPL-based evaluation means (ranking options by perplexity), but what does generation-based evaluation mean here?

Thank you in advance for your help.

Numpy compliation issue during installing

Hello. I created a new conda environment to install this project with python 3.10. According to the error message below it seems like numpy was unable to find the right blas library.

c/umath -Inumpy/core/src/npysort -I/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10 -Ibuild/src.linux-x86_64-3.10/numpy/core/src/common -Ibuild/src.linux-x86_64-3.10/numpy/core/src/npymath -c'
gcc: numpy/core/src/multiarray/alloc.c
gcc: numpy/core/src/multiarray/buffer.c
gcc: numpy/core/src/multiarray/common.c
gcc: numpy/core/src/multiarray/array_assign_scalar.c
gcc: numpy/core/src/multiarray/descriptor.c
gcc: numpy/core/src/multiarray/conversion_utils.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/multiarray/einsum.c
gcc: numpy/core/src/multiarray/datetime_strings.c
gcc: numpy/core/src/multiarray/arrayobject.c
gcc: numpy/core/src/multiarray/array_assign_array.c
gcc: numpy/core/src/multiarray/ctors.c
gcc: numpy/core/src/multiarray/convert.c
gcc: numpy/core/src/multiarray/calculation.c
gcc: numpy/core/src/multiarray/datetime_busday.c
gcc: numpy/core/src/multiarray/arrayfunction_override.c
gcc: numpy/core/src/multiarray/convert_datatype.c
gcc: numpy/core/src/multiarray/hashdescr.c
gcc: numpy/core/src/multiarray/datetime_busdaycal.c
gcc: numpy/core/src/multiarray/item_selection.c
gcc: numpy/core/src/multiarray/compiled_base.c
gcc: numpy/core/src/multiarray/dragon4.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/multiarray/arraytypes.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/multiarray/lowlevel_strided_loops.c
gcc: numpy/core/src/multiarray/multiarraymodule.c
gcc: numpy/core/src/multiarray/datetime.c
gcc: numpy/core/src/multiarray/dtype_transfer.c
gcc: numpy/core/src/multiarray/nditer_constr.c
gcc: numpy/core/src/multiarray/iterators.c
gcc: numpy/core/src/multiarray/refcount.c
gcc: numpy/core/src/multiarray/scalarapi.c
gcc: numpy/core/src/multiarray/nditer_pywrap.c
gcc: numpy/core/src/multiarray/sequence.c
gcc: numpy/core/src/multiarray/shape.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/multiarray/scalartypes.c
numpy/core/src/multiarray/scalartypes.c.src: In function ‘float_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2967:27: error: incompatible type for argument 1 of ‘_Py_HashDouble’
2967 | return _Py_HashDouble((double) PyArrayScalar_VAL(obj, @name@));
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:38: note: expected ‘PyObject *’ {aka ‘struct _object *’} but argument is of type ‘double’
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2967:12: error: too few arguments to function ‘_Py_HashDouble’
2967 | return _Py_HashDouble((double) PyArrayScalar_VAL(obj, @name@));
| ^~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:23: note: declared here
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src: In function ‘cfloat_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2975:31: error: incompatible type for argument 1 of ‘_Py_HashDouble’
2975 | hashreal = _Py_HashDouble((double)
| ^~~~~~~~
| |
| double
2976 | PyArrayScalar_VAL(obj, C@name@).real);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:38: note: expected ‘PyObject *’ {aka ‘struct _object *’} but argument is of type ‘double’
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2975:16: error: too few arguments to function ‘_Py_HashDouble’
2975 | hashreal = _Py_HashDouble((double)
| ^~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:23: note: declared here
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2981:31: error: incompatible type for argument 1 of ‘_Py_HashDouble’
2981 | hashimag = _Py_HashDouble((double)
| ^~~~~~~~
| |
| double
2982 | PyArrayScalar_VAL(obj, C@name@).imag);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:38: note: expected ‘PyObject *’ {aka ‘struct _object *’} but argument is of type ‘double’
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2981:16: error: too few arguments to function ‘_Py_HashDouble’
2981 | hashimag = _Py_HashDouble((double)
| ^~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:23: note: declared here
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src: In function ‘longdouble_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2967:27: error: incompatible type for argument 1 of ‘_Py_HashDouble’
2967 | return _Py_HashDouble((double) PyArrayScalar_VAL(obj, @name@));
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:38: note: expected ‘PyObject *’ {aka ‘struct _object *’} but argument is of type ‘double’
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2967:12: error: too few arguments to function ‘_Py_HashDouble’
2967 | return _Py_HashDouble((double) PyArrayScalar_VAL(obj, @name@));
| ^~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:23: note: declared here
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src: In function ‘clongdouble_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2975:31: error: incompatible type for argument 1 of ‘_Py_HashDouble’
2975 | hashreal = _Py_HashDouble((double)
| ^~~~~~~~
| |
| double
2976 | PyArrayScalar_VAL(obj, C@name@).real);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:38: note: expected ‘PyObject *’ {aka ‘struct _object *’} but argument is of type ‘double’
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2975:16: error: too few arguments to function ‘_Py_HashDouble’
2975 | hashreal = _Py_HashDouble((double)
| ^~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:23: note: declared here
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2981:31: error: incompatible type for argument 1 of ‘_Py_HashDouble’
2981 | hashimag = _Py_HashDouble((double)
| ^~~~~~~~
| |
| double
2982 | PyArrayScalar_VAL(obj, C@name@).imag);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:38: note: expected ‘PyObject *’ {aka ‘struct _object *’} but argument is of type ‘double’
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2981:16: error: too few arguments to function ‘_Py_HashDouble’
2981 | hashimag = _Py_HashDouble((double)
| ^~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:23: note: declared here
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src: In function ‘half_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2997:27: error: incompatible type for argument 1 of ‘_Py_HashDouble’
2997 | return _Py_HashDouble(npy_half_to_double(PyArrayScalar_VAL(obj, Half)));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| double
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:38: note: expected ‘PyObject *’ {aka ‘struct _object *’} but argument is of type ‘double’
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2997:12: error: too few arguments to function ‘_Py_HashDouble’
2997 | return _Py_HashDouble(npy_half_to_double(PyArrayScalar_VAL(obj, Half)));
| ^~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:23: note: declared here
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src: In function ‘longdouble_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2968:1: warning: control reaches end of non-void function [-Wreturn-type]
2968 | }
| ^
numpy/core/src/multiarray/scalartypes.c.src: In function ‘float_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2968:1: warning: control reaches end of non-void function [-Wreturn-type]
2968 | }
| ^
numpy/core/src/multiarray/scalartypes.c.src: In function ‘half_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2998:1: warning: control reaches end of non-void function [-Wreturn-type]
2998 | }
| ^
gcc: numpy/core/src/multiarray/temp_elide.c
gcc: numpy/core/src/multiarray/vdot.c
gcc: numpy/core/src/umath/umathmodule.c
gcc: numpy/core/src/multiarray/typeinfo.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/umath/loops.c
gcc: numpy/core/src/multiarray/usertypes.c
gcc: numpy/core/src/multiarray/number.c
gcc: numpy/core/src/umath/reduction.c
gcc: numpy/core/src/umath/ufunc_object.c
gcc: numpy/core/src/umath/ufunc_type_resolution.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/multiarray/nditer_templ.c
gcc: numpy/core/src/multiarray/flagsobject.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/npymath/ieee754.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/npymath/npy_math_complex.c
gcc: numpy/core/src/multiarray/getset.c
gcc: numpy/core/src/umath/override.c
gcc: numpy/core/src/npymath/halffloat.c
gcc: numpy/core/src/multiarray/nditer_api.c
gcc: numpy/core/src/common/array_assign.c
gcc: numpy/core/src/common/ucsnarrow.c
gcc: numpy/core/src/npymath/npy_math.c
gcc: numpy/core/src/common/mem_overlap.c
gcc: numpy/core/src/common/ufunc_override.c
gcc: numpy/core/src/common/numpyos.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/common/npy_cpu_features.c
gcc: numpy/core/src/common/npy_longdouble.c
gcc: numpy/core/src/umath/extobj.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/umath/scalarmath.c
gcc: numpy/core/src/multiarray/mapping.c
gcc: numpy/core/src/multiarray/methods.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/umath/matmul.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/umath/clip.c
error: Command "gcc -pthread -B /home/ubuntu/mambaforge-pypy3/envs/vlme/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/ubuntu/mambaforge-pypy3/envs/vlme/include -fPIC -O2 -isystem /home/ubuntu/mambaforge-pypy3/envs/vlme/include -fPIC -DNPY_INTERNAL_BUILD=1 -DHAVE_NPY_CONFIG_H=1 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1 -D_LARGEFILE64_SOURCE=1 -Ibuild/src.linux-x86_64-3.10/numpy/core/src/umath -Ibuild/src.linux-x86_64-3.10/numpy/core/src/npymath -Ibuild/src.linux-x86_64-3.10/numpy/core/src/common -Inumpy/core/include -Ibuild/src.linux-x86_64-3.10/numpy/core/include/numpy -Inumpy/core/src/common -Inumpy/core/src -Inumpy/core -Inumpy/core/src/npymath -Inumpy/core/src/multiarray -Inumpy/core/src/umath -Inumpy/core/src/npysort -I/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10 -Ibuild/src.linux-x86_64-3.10/numpy/core/src/common -Ibuild/src.linux-x86_64-3.10/numpy/core/src/npymath -c build/src.linux-x86_64-3.10/numpy/core/src/multiarray/scalartypes.c -o build/temp.linux-x86_64-3.10/build/src.linux-x86_64-3.10/numpy/core/src/multiarray/scalartypes.o -MMD -MF build/temp.linux-x86_64-3.10/build/src.linux-x86_64-3.10/numpy/core/src/multiarray/scalartypes.o.d" failed with exit status 1
[end of output]

    note: This error originates from a subprocess, and is likely not a problem with pip.
    ERROR: Failed building wheel for numpy
  Failed to build numpy
  ERROR: Could not build wheels for numpy, which is required to install pyproject.toml-based projects
  [end of output]

(feature request) can we add load_dotenv() as a small quality of life improvement?

Hi OpenCompass VLMEvalKit team,

Thank you for your hard work on this project! I have a very minor feature request - can we add load_dotenv to make it easier for users to run without explicitly setting their OPENAI_API_KEY env variables in the terminal before their run?

This way, a user can add their key to the variable once in a .env file and it will automatically be loaded.
Happy to open a pull request if helpful.

Error Encountered in Multi-Node Evaluation Using Distributed Arguments

I encountered an issue while attempting to perform a multi-node evaluation using PyTorch's torchrun with specific distributed arguments. Below is the command I used, including the distributed arguments setup and the execution command:

=


DISTRIBUTED_ARGS=" \
    --nproc_per_node 3 \
    --nnodes 4 \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT"

torchrun $DISTRIBUTED_ARGS run.py \
    --data MME MMBench_DEV_EN MMBench_DEV_CN CCBench SEEDBench_IMG MMMU_DEV_VAL MathVista_MINI HallusionBench LLaVABench \
    MMBench_TEST_EN MMBench_TEST_CN \
    --model llava

Upon execution, I received the following error message:

RUN - ERROR - No such file or directory: './llava/312_MME.pkl'
It seems like only the .pkl file on node0 was saved correctly. Only 012,112,212 was saved.
Thank you in advance for your assistance!

Enhancing Multi-Choice Question Handling with Case-Sensitive Matching

It might be beneficial to implement a mechanism for exact matching of uppercase option letters in multi-choice questions. This could help avoid confusion caused by the presence of quantifiers like "a" in responses, which might be mistakenly interpreted as indicating multiple choices. Additionally, in cases where multiple letters or multiple instances of "yes"/"no" appear, the system could prioritize the analysis of the first word in the sentence to determine the intended response.

I am also curious about whether the scores currently displayed on the OpenCompass leaderboard have been updated to reflect these latest modifications. Could you provide any information on this?

TypeError in parallel API calling

企业微信截图_17042675828124

I met this error when I called openai API parallelly (nproc=4). I am not sure if this an error from the side of VLMEvalKit developpers...Could someone give me some tips for fixing it?

where is the vlmeval/utils/data_util.py

image
Excellent work:)
But I can`t fine the data_util file in the lasted version..
And I would like to ask when the Internlm2 7b model will be supported.Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.