flagopen / flageval Goto Github PK

View Code? Open in Web Editor NEW

284.0 284.0 27.0 43.39 MB

FlagEval is an evaluation toolkit for AI large foundation models.

License: Apache License 2.0

Shell 0.33% Python 99.05% Dockerfile 0.62%

flageval's People

Contributors

Stargazers

Watchers

flageval's Issues

Are all the evaluation results on the leaderboard `zero-shot`?

Where to find the shot configuration for each benchmark? @xuanricheng

Are those evaluation metrics all `accuracy`?

The code is a bit hard to understand:

@xuanricheng

flageval-serving 中的 cli.py中的_merge_local_settings问题

请问，该函数中的base变量从哪儿来的？

Any arxiv paper or report for reference?

run evaluate.py and encounter error: KeyError: 'text_config'. How to do?

Running command in evaluate.md
python evaluate.py --datasets=cifar10,cifar100 --model_name=AltCLIP-XLMR-L

The dataset and model are downloaded, but there's an error:

File "C:\anaconda3\lib\site-packages\flagai\model\mm\AltCLIP.py", line 83, in init
self.text_config = STUDENT_CONFIG_DICTkwargs['text_config']['model_type']
KeyError: 'text_config'

According to the source code of class AltCLIPConfig, the text_config should be passed by **kwargs. Actually nothing is passed. How can I do it?

请问本地部署和在智源官网申请就是本地算力和智源算力的差异吗？评测方式和结果没有差异吗

How could resnet50 be 1B parameters?

As we know, ResNet50 has a total of 25,636,712 parameters. Of these, 25,583,592 are trainable and 53,120 are non-trainable. The model has 177 layers.
Check this link for further explanation.

关于最新榜单中通义千问chat版本的评测疑问

    不知道这块评测具体细节，但是大部分模型在中文客观题上chat版比base版有提升，反而千问是断崖式下滑（0.596-》0.070），从结果上来看有点异常。跟opencompass那边的评测的结果也有较大出入。
    建议还是再确认一下评测细节？是不是prompt啥的有点问题

Any planning to include video benchmark?

Such as video comprehension, generation, ...

finder.py 中的问题

from cached_property import cached_property

这个是不是错误的，应该是：
from functools import cached_property
吧？

What are `LLSRC`, `SLSRC`, `SLPWC`, and `SLPWC` benchmarks?

https://arxiv.org/abs/2112.15093
https://aclanthology.org/2021.emnlp-main.306/
是这个吗

How to access to the leaderboards besides NLP?

Why are there discrepancies between the documentation claim and actual benchmarks?

https://flageval.baai.ac.cn/#/taskIntro?t=en_qa

flageval-serving 中的cli.py中的main函数

python的main函数的写法：

def main(): cli()

不应该是：
if __name__ == "__main__": cli()
吗？

Why include empty evaluation result?

关于排行榜的疑问

请问，排行榜中的小数是否是得分？

ChatGLM-6B在中文选择问答Chinese_MMLU数据集下的得分是0.212，是否可以理解为，满分100分的话，得分为21.2分？也就是说，100道题，只答对了21道?

Why is the audio leaderboard greyed out?

关于flageval-serving的问题

请问如果想在本地离线进行，对自然语言模型的评价，可以用flageval-serving 模块，在本地进行吗？

还是说离线测试，必须把模型和代码上传到flageval平台？https://flageval.baai.ac.cn/#/rule?m=2

Where are the evaluation tasks?

FlagEval基础准备工作需要哪些？

你好，我预装了FlagEval. 它应该是没有问题的。那我该如何使用呢？前期是先去玩AltCLIP这类模型？https://github.com/FlagOpen/FlagEval/blob/master/imageEval/README.md 从这份readme来看，这份评估工具是专为多模态模型AltCLIP之类准备的吗？它不适用于Aquila。我这么理解有出入吗？

flageval-serving get output with Unicode. Is this foamat ok?

I write a test using ChatGLM2, and run the server, give an input of "你是谁". And I get a reponse with a bunch of unicode.
Is it ok for your evaluation?

Output is :

{
  "completions": [
    {
      "logprobs": [],
      "text": "\u4f60\u662f\u8c01?\n\n\u6211\u662f ChatGLM,\u662f\u6e05\u534e\u5927\u5b66KEG\u5b9e\u9a8c\u5ba4\u548c\u667a\u8c31AI\u516c\u53f8\u5171\u540c\u8bad\u7ec3\u7684\u8bed\u8a00\u6a21\u578b\u3002\u6211\u7684\u4efb\u52a1\u662f\u670d\u52a1\u5e76\u5e2e\u52a9\u4eba\u7c7b,\u4f46\u6211\u5e76\u4e0d\u662f\u4e00\u4e2a\u771f\u5b9e\u7684\u4eba\u3002",
      "tokens": "\u4f60\u662f\u8c01?\n\n\u6211\u662f ChatGLM,\u662f\u6e05\u534e\u5927\u5b66KEG\u5b9e\u9a8c\u5ba4\u548c\u667a\u8c31AI\u516c\u53f8\u5171\u540c\u8bad\u7ec3\u7684\u8bed\u8a00\u6a21\u578b\u3002\u6211\u7684\u4efb\u52a1\u662f\u670d\u52a1\u5e76\u5e2e\u52a9\u4eba\u7c7b,\u4f46\u6211\u5e76\u4e0d\u662f\u4e00\u4e2a\u771f\u5b9e\u7684\u4eba\u3002",
      "top_logprobs_dicts": []
    }
  ],
  "input_length": 0,
  "model_info": "",
  "status": 200
}

请问关于evaluation具体实现细节相关问题

在知乎文章中您提到，"我们利用 ImageEval-prompt 对知名文生图模型进行评测，针对每个Prompt，让每个模型生成8张图片，标注者在未看到Prompt的情况下对8张图片进行排序，并选择前三张排名较高的图片，最后标注这三张图片是否正确表达了Prompt的关键信息。"

在最后一步，即“标注这三张图片是否正确表达了Prompt的关键信息”，这里的具体操作是什么呢？

例如，对于prompt“穿着华丽的衣服的女士坐在椅子上，素描”，其颜色，性别，五官的标注分别为0，1，2，那么评测人员是否只需要根据标注维度（无视prompt）判断生成的图片是否符合各个维度的标注结果（0未出现，1简单考察，2复杂考察），还是评测人员同时可以看到标注与prompt，再根据标注回到prompt判断图片表达是否准确？例如，标注人员已知性别标注为1，那么根据prompt需要自行判断生成图片内容是否符合prompt中描述的“女士”一项。如果评测人员采取第二种方法，那么对于标注1简短考察与2复杂考察，它们在评测流程中的区别是什么？

最后，在“标注维度说明”一节中，您给出了每一个子维度的具体标注，请问数据集中是否每条数据的每个子维度均有具体标注供评测人员参考？还是目前开源的数据集已经是所有内容了？

感谢您的回答！

Internal Server Error after loading the CV leaderboards

This also applies to multimodal leaderboards.

flagopen / flageval Goto Github PK

flageval's People

Contributors

Stargazers

Watchers

Forkers

flageval's Issues

The dataset and model are downloaded, but there's an error:

Recommend Projects

Recommend Topics

Recommend Org