Git Product home page Git Product logo

agieval's Introduction

AGIEval

This repository contains information about AGIEval, data, code and output of baseline systems for the benchmark.

Introduction

AGIEval is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving. This benchmark is derived from 20 official, public, and high-standard admission and qualification exams intended for general human test-takers, such as general college admission tests (e.g., Chinese College Entrance Exam (Gaokao) and American SAT), law school admission tests, math competitions, lawyer qualification tests, and national civil service exams. For a full description of the benchmark, please refer to our paper: AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models.

Tasks and Data

AGIEval v1.0 contains 20 tasks, including two cloze tasks (Gaokao-Math-Cloze and MATH) and 18 multi-choice question answering tasks (the rest). Among the multi-choice question answering tasks, Gaokao-physics and JEC-QA have one or more answers, and the other tasks only have one answer. You can find the full list of tasks in the table below. The datasets used in AGIEVal

You can download all post-processed data in the data/v1 folder. All usage of the data should follow the license of the original datasets. We provide the citation information of the original datasets in the Citation section below.

The data format for all datasets is as follows:

{
    "passage": null,
    "question": "设集合 $A=\\{x \\mid x \\geq 1\\}, B=\\{x \\mid-1<x<2\\}$, 则 $A \\cap B=$ ($\\quad$)\\\\\n",
    "options": ["(A)$\\{x \\mid x>-1\\}$", 
        "(B)$\\{x \\mid x \\geq 1\\}$", 
        "(C)$\\{x \\mid-1<x<1\\}$", 
        "(D)$\\{x \\mid 1 \\leq x<2\\}$"
        ],
    "label": "D",
    "answer": null
}

The passage field is available for gaokao-chinese, gaokao-english, both of logiqa, all of LSAT, and SAT. The answer for multi-choice tasks is saved in the label field. The answer for cloze tasks is saved in the answer field.

We provide the prompts for few-shot learning in the data/v1/few_shot_prompts file.

Baseline Systems

We evaluate the performance of the baseline systems on AGIEval v1.0. The baseline systems are based on the following models: text-davinci-003, ChatGPT (gpt-3.5-turbo), and GPT-4. You can replicate the results by following the steps below:

  1. fill in your OpenAI API key in the openai_api.py file.
  2. run the run_prediction.py file to get the results.

Model Outputs

You can download the zero-shot, zero-shot-Chain-of-Thought, few-shot and few-shot-Chain-of-Thought outputs of the baseline systems for the first version of AGIEval in the Onedrive link.

Evaluation

You can run the post_process_and_evaluation.py file to get the evaluation results.

Leaderboard

Thanks to OpenCompass who collects the results of multiple models on AGIEval, you can refer to the leaderboard for the latest results.

Citation

If you use AGIEval dataset or the code in your research, please cite our paper:

@misc{zhong2023agieval,
      title={AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models}, 
      author={Wanjun Zhong and Ruixiang Cui and Yiduo Guo and Yaobo Liang and Shuai Lu and Yanlin Wang and Amin Saied and Weizhu Chen and Nan Duan},
      year={2023},
      eprint={2304.06364},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Please make sure to cite all the individual datasets in your paper when you use them. We provide the relevant citation information below:

@inproceedings{ling-etal-2017-program,
    title = "Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems",
    author = "Ling, Wang  and
      Yogatama, Dani  and
      Dyer, Chris  and
      Blunsom, Phil",
    booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2017",
    address = "Vancouver, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/P17-1015",
    doi = "10.18653/v1/P17-1015",
    pages = "158--167",
    abstract = "Solving algebraic word problems requires executing a series of arithmetic operations{---}a program{---}to obtain a final answer. However, since programs can be arbitrarily complicated, inducing them directly from question-answer pairs is a formidable challenge. To make this task more feasible, we solve these problems by generating answer rationales, sequences of natural language and human-readable mathematical expressions that derive the final answer through a series of small steps. Although rationales do not explicitly specify programs, they provide a scaffolding for their structure via intermediate milestones. To evaluate our approach, we have created a new 100,000-sample dataset of questions, answers and rationales. Experimental results show that indirect supervision of program learning via answer rationales is a promising strategy for inducing arithmetic programs.",
}

@inproceedings{hendrycksmath2021,
  title={Measuring Mathematical Problem Solving With the MATH Dataset},
  author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},
  journal={NeurIPS},
  year={2021}
}

@inproceedings{Liu2020LogiQAAC,
  title={LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning},
  author={Jian Liu and Leyang Cui and Hanmeng Liu and Dandan Huang and Yile Wang and Yue Zhang},
  booktitle={International Joint Conference on Artificial Intelligence},
  year={2020}
}

@inproceedings{zhong2019jec,
  title={JEC-QA: A Legal-Domain Question Answering Dataset},
  author={Zhong, Haoxi and Xiao, Chaojun and Tu, Cunchao and Zhang, Tianyang and Liu, Zhiyuan and Sun, Maosong},
  booktitle={Proceedings of AAAI},
  year={2020},
}

@article{Wang2021FromLT,
  title={From LSAT: The Progress and Challenges of Complex Reasoning},
  author={Siyuan Wang and Zhongkun Liu and Wanjun Zhong and Ming Zhou and Zhongyu Wei and Zhumin Chen and Nan Duan},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2021},
  volume={30},
  pages={2201-2216}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

agieval's People

Contributors

eureka6174 avatar microsoft-github-operations[bot] avatar microsoft-github-policy-service[bot] avatar microsoftopensource avatar ruixiangcui avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

agieval's Issues

Bug in Dataset Loader for Few-Shot Multiple Choice Questions

I've noticed that the current code uses the expression demo + question. However, I believe the correct expression should be demo + question_input. By using demo + question, the previously defined question_input is not being utilized and some multiple-choice questions may lack options in the prompt. Please consider updating the code to reflect this change for proper functionality. Thank you!

https://github.com/microsoft/AGIEval/blob/main/src/dataset_loader.py#L215

gaokao-english dirty data

The gaokao-english has a dirty data.

The question is

The engineer Camillo Oliver was 40 years old when he started the company in 1908. At his factory in Ivrea, he designed and produced the first Italian typewriter. Today the company's head office s still in Ivrea, near Turin, but the company is much larger than it was in those days and there are offices all around the world.By 1930 there was a staff of 700 and the company turned out 13,000 machines a year. Some went to customers in Italy, but Olivetti exported more typewriters to other countries.Camillo's son, Adriano, started working for the company in 1924 and later he became the boss. He introduced a standard speed for the production line and he employed technology and design specialists. The company developed new and better typewriters and then calculators(计算机). In 1959 it produced the ELEA computer system. This was the first mainframe(主机)computer designed and made in Italy.After Adriano died in 1960, the company had a period of financial problems. Other companies, especially the Japanese, made faster progress in electronic technology than the Italian company. In 1978, Carlo de Benedetti became the new boss. Olivetti increased its marking and service networks and made agreements with other companies to design and produce more advanced office equipment. Soon it became one of the world's leading companies in information technology and communications. There are now five independent companies in the Olivetti group—one for personal computers, one for Systems and services, and two for telecommunications.

The option is:

like:

['(A)It produced the best typewriter in the world.     ', '(B)It designed the world’s firs![]()t mainframe computer.', '(C)It exported more typewriters than other companies.', '(D)It has five independent companies with its head office in Ivrea.']

The option B has some dirty string.

Details about the data collection

Thanks for your awesome work! I notice that Gaokao is an important part in your dataset, but most Gaokao papers are not freely available online. Could you please explain how to collect the Gaokao dataset? Thanks in advance :)

Unicode escape sequences in the json data

If you inspect aqua-rat.jsonl (and other datasets), there are unicode escape sequences throughout the data.

{"passage": null, "question": "A car is being driven, in a straight line and at a uniform speed, towards the base of a vertical tower. The top of the tower is observed from the car and, in the process, it takes 10 minutes for the angle of elevation to change from 45\u00b0 to 60\u00b0. After how much more time will this car reach the base of the tower?", "options": ["(A)5(\u221a3 + 1)", "(B)6(\u221a3 + \u221a2)", "(C)7(\u221a3 \u2013 1)", "(D)8(\u221a3 \u2013 2)", "(E)None of these"], 

This can be prevented by going back to the original script you used to write out the data and adding ensure_ascii=False and encode('utf-8') before writing to your file, like so:

f.write(json.dumps(row, ensure_ascii=False)+ '\n').encode('utf8'))

Dirty data in the dataset.

Hi, when I parse the dataset's options, I found unnormal behavior which the length of options is different from others in the same subcatrgory.

  1. In gaokao-chemistry.jsonl, line 190's options include invalid options (which is actually the question's analysis). The length of options is actually 7 not 4.
    20230817-170517
    After option "D", there is a fifth option.
    20230817-170556

  2. Missing options.

  • In sat-en-without-passage.jsonl, line 17's options miss option D which should be "They may increase in value as those same resources become rare on Earth." reference
    20230817-171359

  • In sat-en-without-passage.jsonl, line 57's options miss option D which should be "No, because the data do not indicate whether the honeybees had been infected with mites." while the label is "D". reference
    img_v2_83f511ea-27ce-45ab-a43e-df788a0fbe0g

  • In sat-en-without-passage.jsonl, line 98's options miss option D which should be "Published theories of scientists who developed earlier models of the Venus flytrap". You can refer to question 11 in reference.
    img_v2_5ad1f5fc-cd5d-4a2d-a607-94296e2c4abg

The same goes for sat-en.jsonl in line 17, 57 and 98.

  1. In jec-qa-kd.jsonl, line 212's label is empty. The content is also dirty.
    img_v2_e9f4cde5-a876-465b-9968-f743fb24040g
    img_v2_46ad402b-05a6-4605-900f-c2b089fd082g

Several problems in logiqa-zh

There are several problems in logiqa-zh, e.g.

[ "A 没有党参", "B 没有首乌", "C 有白术", "D 没有白术" ]

and it should be

[ "(A)没有党参", "(B)没有首乌", "(C)有白术", "(D)没有白术" ]

There is a format error in the data, and an error may be reported when parsing json. In addition, it is strongly recommended to clean the data to provide users with higher quality evaluation data.

https://github.com/ruixiangcui/AGIEval/blob/main/data/v1/gaokao-chemistry.jsonl#L75
{"passage": null, "question": "水溶液呈酸性的是( $)$", "options": ["(A)$\\mathrm{NaCl}$", "(B)$\\mathrm{NaHSO}_{4}$", "(C)HCOONa", "(D)$\mathrm{NaHCO}_{3}"], "label": "B", "answer": null, "other": {"source": "2020年浙江省高考化学【7月】"}}
Option D is missing a backslash \

Will human evaluation results be public?

I am interested in the human evaluation result, but there are only 4 pictures. So I want to konw whther the result(detailed or overall numeric results) will be public?

SAT-Math corpus includes incomplete data

in sat-math corpus, it happens to have incomplete question, which may make it insufficient to solve.

{"passage": "", "question": "Which of the following is equivalent to the expression above?" ...

the few-shot-prompt format is different in gaokao-geography dataset

The few-shot prompts in gaokao-geography dataset looks like this:

{'passage': None, 'question': '在某城市中心,一种创新型绿色建筑一垂直森林高层住宅落成面世。它是在建筑的垂直方向上,覆盖满本地乔木、灌木和草本等植物,为每层住户营造“空中花园”,形成具有森林效应的生态居住群落。与传统设计相比,“垂直森林”在居住空间设计上变化最大的地方是( )', 'options': ['A. 阳台\tB. 客厅\tC. 卧室\tD. 厨房'], 'label': 'A', 'answer': None, 'other': {'source': '2022年湖北省高考地理试题'}}

It should be

'options': ['(A)阳台', '(B)客厅', '(C)卧室', '(D)厨房']

Error in gaokao-chemistry dataset

The options are wrong in this data
https://github.com/ruixiangcui/AGIEval/blob/main/data/v1/gaokao-chemistry.jsonl#L108

{"passage": null, "question": "2007年3月21日,我国公布了111号元素Rg的中文名称.该元素名称及所在周期是(  )", "options": ["錀   第七周期", "镭 第七周期", "(C)铼 第六周期", "(D)氡 第六周期"], "label": "A", "answer": null, "other": {"source": "2007年天津高考化学试题"}}

It should be

{"passage": null, "question": "2007年3月21日,我国公布了111号元素Rg的中文名称.该元素名称及所在周期是(  )", "options": ["(A)錀   第七周期", "(B)镭 第七周期", "(C)铼 第六周期", "(D)氡 第六周期"], "label": "A", "answer": null, "other": {"source": "2007年天津高考化学试题"}}

Multiple choice in gaokao-mathqa dataset

There are about 7 multiple choice questions in gaokao-mathqa dataset, e.g.
https://github.com/ruixiangcui/AGIEval/blob/main/data/v1/gaokao-mathqa.jsonl#L149

{"passage": null, "question": "函数 $f(x)=\\sin (2 x+\\varphi)(0<\\varphi<\\pi)$ 的图象以 $\\left(\\frac{2 \\pi}{3}, 0\\right)$ 中心对称, 则 ($\\quad$)\\\\\n", "options": ["(A)$y=f(x)$ 在 $\\left(0, \\frac{5 \\pi}{12}\\right)$ 单调递减", "(B)$y=f(x)$ 在 $\\left( -\\frac{\\pi}{12}, \\frac{11 \\pi}{12}\\right)$ 有 $2$ 个极值点", "(C)直线 $x= \\frac{7 \\pi}{6} $ 是一条对称轴", "(D)直线 $y= \\frac{\\sqrt{3}}{2} - x $ 是一条切线"], "label": "AD", "answer": null, "other": {"source": "2022年全国新高考II卷数学"}}

which doesn't match the format in gaokao-physics, i.e. ["A", "D"] .

parse_result error in gaokao-physics-zero-shot

image
model output is :
"model_output": "选项 (B) $3.3\mathrm{MeV}$。", "parse_result": ["B", "M", "V"], "label": "B", "is_correct": false;
as gaokao-physics has multi-answer, it will take all uppercase letters, which makes the correct answer become an error.

def parse_qa_multiple_answer(string, setting_name):
    if setting_name == "few-shot-CoT":
        string = extract_last_line(string)
    pattern = "\(*([A-Z])\)*"
    match = re.findall(pattern, string)
    if match:
        return match
    return []

maybe we can make a candidate answer list like ["A", "B", "C", "D", "E", "F"] to reduce the prob of error?

def parse_qa_multiple_answer(string, setting_name):
    if setting_name == "few-shot-CoT":
        string = extract_last_line(string)
    pattern = "\(*([A-F])\)*"
    match = re.findall(pattern, string)
    if match:
        return match
    return []

About API_dic

How to get the custum_api_name?Why i have some error?

multi-thread n = 3
found error: Error communicating with OpenAI: HTTPSConnectionPool(host='test.openai.azure.com', port=443): Max retries exceeded with url: //openai/deployments/davinci-003/chat/completions?api-version=2023-03-15-preview (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)')))
multi-thread n = 1
found error: Error communicating with OpenAI: HTTPSConnectionPool(host='test.openai.azure.com', port=443): Max retries exceeded with url: //openai/deployments/davinci-003/chat/completions?api-version=2023-03-15-preview (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)')))
multi-thread n = 1
found error: Error communicating with OpenAI: HTTPSConnectionPool(host='test.openai.azure.com', port=443): Max retries exceeded with url: //openai/deployments/davinci-003/chat/completions?api-version=2023-03-15-preview (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)')))
multi-thread n = 1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.