Git Product home page Git Product logo

multiq's Introduction

Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ

arxiv preprint

Large language models (LLMs) need to serve everyone, including a global majority of non-English speakers. However, most LLMs today, and open LLMs in particular, are often intended for use in just English (e.g. Llama2, Mistral) or a small handful of high-resource languages (e.g. Mixtral, Qwen). Recent research shows that, despite limits in their intended use, people prompt LLMs in many different languages. Therefore, in this paper, we investigate the basic multilingual capabilities of state-of-the-art open LLMs beyond their intended use. For this purpose, we introduce MultiQ, a new silver standard benchmark for basic open-ended question answering with 27.4k test questions across a typologically diverse set of 137 languages. With MultiQ, we evaluate language fidelity, i.e. whether models respond in the prompted language, and question answering accuracy. All LLMs we test respond faithfully and/or accurately for at least some languages beyond their intended use. Most models are more accurate when they respond faithfully. However, differences across models are large, and there is a long tail of languages where models are neither accurate nor faithful. We explore differences in tokenization as a potential explanation for our findings, identifying possible correlations that warrant further investigation.

This is a joint work by Carolin Holtermann, Paul Röttger, Timm Dill and Anne Lauscher. For further details feel free to check out our paper.

MultiQ on Huggingface

Our silver standard benchmark is available on Huggingface: see here

Getting Started

We conducted all our experiments with Python 3.10. Before getting started, make sure you install the requirements listed in the requirements.txt file.

pip install -r requirements.txt

Using GlotLID

For our experiments we used the model_v2.bin configuration of the publicly available GlotLID model see here. However, since the authors of this model are constantly making changes and covering even more languages with their model in further versions, it is worth visiting the GlotLID repository for the latest updates.

Repository Description

This repository contains all code and data needed to reproduce the experiments and results reported in our paper. All data files can be found in the data folder, while all relevant code files can be found in the src folder, both with corresponding readme files.


License

This project is licensed under the CC-BY-4.0 License - see the LICENSE.md file for details

multiq's People

Contributors

caroholt avatar timmdl avatar

Stargazers

 avatar YunfanX avatar Amir Hossein Kargaran avatar Hele-Andra Kuulmets avatar Matan Kleyman avatar

Watchers

Paul Röttger avatar

multiq's Issues

GlotLID

Hi, Thanks for using GlotLID in your project.

Based on your feedback in the paper, and the way you used GlotLID we improved GlotLID into version 3.

For the reproducibility of your results, I want to ask you to change model.bin in your code to model_v2.bin. This ensures that it downloads the version you used to obtain your results, and you won't need to reproduce the results again (model.bin always refers to the latest model.). I think only detection_GlotLID.py#L113 needs to be changed for the sake of reproducibility.

model_path = hf_hub_download(repo_id="cis-lmu/glotlid", filename="model_v2.bin", cache_dir=None)

Version 3, based on your feedback, adds both Meiteilon (Manipuri) and Dogri, and also ensures to cover all other Indian languages, even in transliteration. You can see list of them here for v3: https://github.com/cisnlp/GlotLID/blob/main/languages-v3.md

Also, I've seen in your code that there seems to be a hard time managing ISO codes. In the v3 design, we decided to make labels more exclusive of each other. For this reason, some of the "macro" languages that we already cover a good variety of "individual" languages are deleted. Additionally, if two labels are very close and make predictions change a lot, we decided to merge them or delete one of them.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.