Git Product home page Git Product logo

instruct-eval-axie's Introduction

๐Ÿซ ๐Ÿฎ ๐Ÿ“š InstructEval: Towards Holistic Evaluation of Instruction-Tuned Large Language Models

Paper | Model | Leaderboard

๐Ÿ“ฃ Introducing Resta: Safety Re-alignment of Language Models. Paper Github

๐Ÿ“ฃ Red-Eval, the benchmark for Safety Evaluation of LLMs has been added: Red-Eval

๐Ÿ“ฃ Introducing Red-Eval to evaluate the safety of the LLMs using several jailbreaking prompts. With Red-Eval one could jailbreak/red-team GPT-4 with a 65.1% attack success rate and ChatGPT could be jailbroken 73% of the time as measured on DangerousQA and HarmfulQA benchmarks. More details are here: Code and Paper.

๐Ÿ“ฃ We developed Flacuna by fine-tuning Vicuna-13B on the Flan collection. Flacuna is better than Vicuna at problem-solving. Access the model here https://huggingface.co/declare-lab/flacuna-13b-v1.0.

๐Ÿ“ฃ The InstructEval benchmark and leaderboard have been released.

๐Ÿ“ฃ The paper reporting Instruction Tuned LLMs on the InstructEval benchmark suite has been released on Arxiv. Read it here: https://arxiv.org/pdf/2306.04757.pdf

๐Ÿ“ฃ We are releasing IMPACT, a dataset for evaluating the writing capability of LLMs in four aspects: Informative, Professional, Argumentative, and Creative. Download it from Huggingface: https://huggingface.co/datasets/declare-lab/InstructEvalImpact.

๐Ÿ“ฃ FLAN-T5 is also useful in text-to-audio generation. Find our work at https://github.com/declare-lab/tango if you are interested.

This repository contains code to evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks. We aim to facilitate simple and convenient benchmarking across multiple tasks and models.

Why?

Instruction-tuned models such as Flan-T5 and Alpaca represent an exciting direction to approximate the performance of large language models (LLMs) like ChatGPT at lower cost. However, it is challenging to compare the performance of different models qualitatively. To evaluate how well the models generalize across a wide range of unseen and challenging tasks, we can use academic benchmarks such as MMLU and BBH. Compared to existing libraries such as evaluation-harness and HELM, this repo enables simple and convenient evaluation for multiple models. Notably, we support most models from HuggingFace Transformers ๐Ÿค— (check here for a list of models we support):

Results

For detailed results, please go to our leaderboard

Model Name Model Path Paper Size MMLU BBH DROP HumanEval
GPT-4 Link ? 86.4 80.9 67.0
ChatGPT Link ? 70.0 64.1 48.1
seq_to_seq google/flan-t5-xxl Link 11B 54.5 43.9
seq_to_seq google/flan-t5-xl Link 3B 49.2 40.2 56.3
llama eachadea/vicuna-13b Link 13B 49.7 37.1 32.9 15.2
llama decapoda-research/llama-13b-hf Link 13B 46.2 37.1 35.3 13.4
seq_to_seq declare-lab/flan-alpaca-gpt4-xl Link 3B 45.6 34.8
llama TheBloke/koala-13B-HF Link 13B 44.6 34.6 28.3 11.0
llama chavinlo/alpaca-native Link 7B 41.6 33.3 26.3 10.3
llama TheBloke/wizardLM-7B-HF Link 7B 36.4 32.9 15.2
chatglm THUDM/chatglm-6b Link 6B 36.1 31.3 44.2 3.1
llama decapoda-research/llama-7b-hf Link 7B 35.2 30.9 27.6 10.3
llama wombat-7b-gpt4-delta Link 7B 33.0 32.4 7.9
seq_to_seq bigscience/mt0-xl Link 3B 30.4
causal facebook/opt-iml-max-1.3b Link 1B 27.5 1.8
causal OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 Link 12B 27.0 30.0 9.1
causal stabilityai/stablelm-base-alpha-7b Link 7B 26.2 1.8
causal databricks/dolly-v2-12b Link 12B 25.7 7.9
causal Salesforce/codegen-6B-mono Link 6B 27.4
causal togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1 Link 7B 38.1 31.3 24.7 5.5

Example Usage

Evaluate on Massive Multitask Language Understanding (MMLU) which includes exam questions from 57 tasks such as mathematics, history, law, and medicine. We use 5-shot direct prompting and measure the exact-match score.

python main.py mmlu --model_name llama --model_path chavinlo/alpaca-native
# 0.4163936761145136

python main.py mmlu --model_name seq_to_seq --model_path google/flan-t5-xl 
# 0.49252243270189433

Evaluate on Big Bench Hard (BBH) which includes 23 challenging tasks for which PaLM (540B) performs below an average human rater. We use 3-shot direct prompting and measure the exact-match score.

python main.py bbh --model_name llama --model_path TheBloke/koala-13B-HF --load_8bit
# 0.3468942926723247

Evaluate on DROP which is a math question answering benchmark. We use 3-shot direct prompting and measure the exact-match score.

python main.py drop --model_name seq_to_seq --model_path google/flan-t5-xl 
# 0.5632458233890215

Evaluate on HumanEval which includes 164 coding questions in python. We use 0-shot direct prompting and measure the pass@1 score.

python main.py humaneval  --model_name llama --model_path eachadea/vicuna-13b --n_sample 1 --load_8bit
# {'pass@1': 0.1524390243902439}

Setup

Install dependencies and download data.

conda create -n instruct-eval python=3.8 -y
conda activate instruct-eval
pip install -r requirements.txt
mkdir -p data
wget https://people.eecs.berkeley.edu/~hendrycks/data.tar -O data/mmlu.tar
tar -xf data/mmlu.tar -C data && mv data/data data/mmlu

instruct-eval-axie's People

Contributors

chiayewken avatar emrys-hong avatar soujanyaporia avatar anzexie avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.