`EvalPlus(📖) => 📚`

Warning

🚨 Evaluating LLM-generated code on a dataset with just _3_ test-cases is NOT enough! 🚨

To address this, we started the EvalPlus project -- a rigourous evaluation framework for LLM4Code that:

✨ improves programming benchmarks by patching up to thousands of new tests! EvalPlus(HumanEval) => HumanEval+ (81x new tests!)
✨ crafts a set utility tools to sanitize, visualize and inspect LLM-generated code and evaluation results!
✨ accelerates LLM4Code research by open-sourcing LLM-generated samples for 14+ models -- no need to re-run the expensive benchmarks!

Read our paper for more detailed findings!

Use EvalPlus-enhanced dataset

To get started, please first setup the environment:

git clone https://github.com/evalplus/evalplus.git
cd evalplus
pip install -r requirements.txt
export PYTHONPATH=$PYTHONPATH:$(pwd)

HumanEval+

from evalplus.data import get_human_eval_plus

fe = get_human_eval_plus() # -> a list of dictionaries (each is a programming problem)
# "task_id" is the identifier string for the task
# "entry_point": name of the function
# "prompt" is the function signature with docstring
# + "canonical_solution" is the ground-truth implementation (re-implemented to fix bugs in HumanEval)
# + "base_input" is the test inputs in original HumanEval
# + "plus_input" is the test inputs brought by EvalPlus
# and others...

MBPP+ (TBD)

Useful tools

Syntax checker for LLM-generated code

Check LLM-produced code and answer the following questions:

Is the generation entirely done for all samples / all problems in the dataset?
Are LLM-generated code compilable? (if no, something could be wrong and you'd better check)

python tools/checker.py --folder /path/to/[model]-[??]b_temp_[??] --dataset humaneval

Post code sanitizer

LLM-generated code may contain some syntax errors. But some of them can be easily fixable by doing simple post-processing. This tool will make the LLM-generated code more clean/compilable by doing certain post-processing such as trimming with more magical EOFs and some garbage non-code tokens.

python tools/sanitize.py --eof --folder /path/to/vicuna-[??]b_temp_[??]
# Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`

Render `pass@k` results to `rich` and LaTeX tables

python tools/render.py --type /path/to/[model]-[??]b # NOTE: no `_temp_[??]`

Perform test input generation from scratch (TBD)

Development

Before you start:

pip install -r requirements.txt
pre-commit install
export PYTHONPATH=$PYTHONPATH:$(pwd)

Name convention

evalplus is the package name.
${DATASET}_plus is the name of dataset applied with evalplus.

Citation

@article{evalplus,
  title={Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
  author={Jiawei Liu and Chunqiu Steven Xia and Yuyao Wang and Lingming Zhang},
  journal={arXiv preprint arXiv:2305.01210},
  year={2023},
}

Acknowledgement

HumanEval

daoyuan14 / evalplus Goto Github PK

evalplus's Introduction

`EvalPlus(📖) => 📚`

Use EvalPlus-enhanced dataset

HumanEval+

MBPP+ (TBD)

Useful tools

Syntax checker for LLM-generated code

Post code sanitizer

Render `pass@k` results to `rich` and LaTeX tables

Perform test input generation from scratch (TBD)

Development

Name convention

Citation

Acknowledgement

evalplus's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

daoyuan14 / evalplus Goto Github PK

evalplus's Introduction

EvalPlus(📖) => 📚

Use EvalPlus-enhanced dataset

HumanEval+

MBPP+ (TBD)

Useful tools

Syntax checker for LLM-generated code

Post code sanitizer

Render pass@k results to rich and LaTeX tables

Perform test input generation from scratch (TBD)

Development

Name convention

Citation

Acknowledgement

evalplus's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org

`EvalPlus(📖) => 📚`

Render `pass@k` results to `rich` and LaTeX tables