Git Product home page Git Product logo

mlmm-evaluation's Introduction

Evaluation Framework for Multilingual Large Language Models

Overview

This repo contains benchmark datasets and evaluation scripts for Multilingual Large Language Models (LLMs). These datasets can be used to evaluate the models across 26 different languages and encompass three distinct tasks: ARC, HellaSwag, and MMLU. This is released as a part of our Okapi framework for multilingual instruction-tuned LLMs with reinforcement learning from human feedback.

  • ARC: A dataset with 7,787 genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering.
  • HellaSwag: HellaSWAG is a dataset for studying grounded commonsense inference. It consists of 70k multiple choice questions about grounded situations: each question comes from one of two domains activitynet or wikihow with four answer choices about what might happen next in the scene. The correct answer is the (real) sentence for the next event; the three incorrect answers are adversarially generated and human verified, so as to fool machines but not humans.
  • MMLU: This dataset contains multiple choice questions derived from diverse fields of knowledge. The test covers subjects in the humanities, social sciences, hard sciences, and other essential areas of learning for certain individuals.

Currently, our datasets support 26 languages: Russian, German, Chinese, French, Spanish, Italian, Dutch, Vietnamese, Indonesian, Arabic, Hungarian, Romanian, Danish, Slovak, Ukrainian, Catalan, Serbian, Croatian, Hindi, Bengali, Tamil, Nepali, Malayalam, Marathi, Telugu, and Kannada.

These datasets are translated from the original ARC, HellaSwag, and MMLU datasets in English using ChatGPT. Our technical paper for Okapi to describe the datasets along with evaluation results for several multilingual LLMs (e.g., BLOOM, LLaMa, and our Okapi models) can be found here.

Usage and License Notices: Our evaluation framework is intended and licensed for research use only. The datasets are CC BY NC 4.0 (allowing only non-commercial use) that should not be used outside of research purposes.

Install

To install lm-eval from our repository main branch, run:

git clone https://github.com/nlp-uoregon/mlmm-evaluation.git
cd mlmm-evaluation
pip install -e ".[multilingual]"

Basic Usage

Firstly, you need to download the multilingual evaluation datasets by using the following script:

bash scripts/download.sh

To evaluate your model on three tasks, you can use the following script:

bash scripts/run.sh [LANG] [YOUR-MODEL-PATH]

For instance, if you want to evaluate our Okapi Vietnamese model, you could run:

bash scripts/run.sh vi uonlp/okapi-vi-bloom

Leaderboard

We maintain a leaderboard for tracking the progress of multilingual LLM.

Acknowledgements

Our framework inherited largely from the lm-evaluation-harness repo from EleutherAI. Please also kindly cite their repo if you use the code.

Citation

If you use the data, model, or code in this repository, please cite:

@article{dac2023okapi,
  title={Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback},
  author={Dac Lai, Viet and Van Nguyen, Chien and Ngo, Nghia Trung and Nguyen, Thuat and Dernoncourt, Franck and Rossi, Ryan A and Nguyen, Thien Huu},
  journal={arXiv e-prints},
  pages={arXiv--2307},
  year={2023}
}

mlmm-evaluation's People

Contributors

anoperson avatar chiennv2000 avatar laiviet avatar usernameisintentionallyhidden avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

mlmm-evaluation's Issues

Got stuck when evaluating MMLU

Thanks for your open sourcing! i'm trying to evaluate Llama-7b-hf on mmlu-fr, a warning of Token indices sequence length is longer than the specified maximum sequence length for this model (5023 > 4096). Running this sequence through the model will result in indexing errors occurs and it seems the process is stuck. Here is the callstack after keyboard interrupt:

Token indices sequence length is longer than the specified maximum sequence length for this model (5023 > 4096). Running this sequence through the model will result in indexing errors
^CTraceback (most recent call last):
  File "/data2/zl/code/mlmm-evaluation/main.py", line 135, in <module>
    main()
  File "/data2/zl/code/mlmm-evaluation/main.py", line 108, in main
    results = evaluator.open_llm_evaluate(
  File "/data2/zl/code/mlmm-evaluation/lm_eval/utils.py", line 205, in _wrapper
    return fn(*args, **kwargs)
  File "/data2/zl/code/mlmm-evaluation/lm_eval/evaluator.py", line 79, in open_llm_evaluate
    results = evaluate(
  File "/data2/zl/code/mlmm-evaluation/lm_eval/utils.py", line 205, in _wrapper
    return fn(*args, **kwargs)
  File "/data2/zl/code/mlmm-evaluation/lm_eval/evaluator.py", line 262, in evaluate
    resps = getattr(lm, reqtype)([req.args for req in reqs])
  File "/data2/zl/code/mlmm-evaluation/lm_eval/base.py", line 181, in loglikelihood
    context_enc = self.tok_encode(context)
  File "/data2/zl/code/mlmm-evaluation/lm_eval/models/huggingface.py", line 361, in tok_encode
    return self.tokenizer.encode(string, add_special_tokens=self.add_special_tokens)
  File "/opt/conda/envs/lm_eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2569, in encode
    encoded_inputs = self.encode_plus(
  File "/opt/conda/envs/lm_eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2977, in encode_plus
    return self._encode_plus(
  File "/opt/conda/envs/lm_eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 576, in _encode_plus
    batched_output = self._batch_encode_plus(
  File "/opt/conda/envs/lm_eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 504, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
KeyboardInterrupt

it seems the process is stuck in the batched tokenizing, how to deal with this?

ARC-Easy Dataset

Dear authors,

Thanks for your nice work. I am wondering if you also translated the ARC-easy dataset as currently the bash download script only yields the ARC-Challenge dataset.

I really appreciate it if you could release that part as well. Thanks!

Best,
Feng

Doesn't work with any HF model

Hello,

I've been trying with different LLMs but I haven't been able to make it works. Could you bring some light?

luispoveda93@LUIS-PC:~/mlmm-evaluation$  bash scripts/run.sh es microsoft/Phi-3-mini-4k-instruct
Selected Tasks: ['arc_es', 'hellaswag_es', 'mmlu_es']
config.json: 100%|█████████████████████████████████████████████████████████████████████| 904/904 [00:00<00:00, 9.41MB/s]
Traceback (most recent call last):
  File "/home/luispoveda93/mlmm-evaluation/main.py", line 135, in <module>
    main()
  File "/home/luispoveda93/mlmm-evaluation/main.py", line 108, in main
    results = evaluator.open_llm_evaluate(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/luispoveda93/mlmm-evaluation/lm_eval/utils.py", line 205, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/luispoveda93/mlmm-evaluation/lm_eval/evaluator.py", line 66, in open_llm_evaluate
    lm = lm_eval.models.get_model(model).create_from_arg_string(
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/luispoveda93/mlmm-evaluation/lm_eval/base.py", line 116, in create_from_arg_string
    return cls(**args, **args2)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/luispoveda93/mlmm-evaluation/lm_eval/models/huggingface.py", line 169, in __init__
    self._config = self.AUTO_CONFIG_CLASS.from_pretrained(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/luispoveda93/.pyenv/versions/3.11.9/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 931, in from_pretrained
    trust_remote_code = resolve_trust_remote_code(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/luispoveda93/.pyenv/versions/3.11.9/lib/python3.11/site-packages/transformers/dynamic_module_utils.py", line 627, in resolve_trust_remote_code
    raise ValueError(
ValueError: Loading microsoft/Phi-3-mini-4k-instruct requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option `trust_remote_code=True` to remove this error.

ollama installed models

Hello,
I've been trying to run the framework using a model I installed with Ollama, but I haven't been able to do it, maybe it's related to the model path, but I'm not sure. Have you tried models installed with Ollama?

Need to submit results

Hi,
Can I submit results one by one for languages or do I have to do it altogether?
Thanks

Please add support for Adapter Models

The scripts look for config.json in the hf repo. But for models whch are finetuned / adapter models that file is adapter_config.json wherein I might also need to give the adapter weights as well

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.