multiq's Introduction

Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ

Large language models (LLMs) need to serve everyone, including a global majority of non-English speakers. However, most LLMs today, and open LLMs in particular, are often intended for use in just English (e.g. Llama2, Mistral) or a small handful of high-resource languages (e.g. Mixtral, Qwen). Recent research shows that, despite limits in their intended use, people prompt LLMs in many different languages. Therefore, in this paper, we investigate the basic multilingual capabilities of state-of-the-art open LLMs beyond their intended use. For this purpose, we introduce MultiQ, a new silver standard benchmark for basic open-ended question answering with 27.4k test questions across a typologically diverse set of 137 languages. With MultiQ, we evaluate language fidelity, i.e. whether models respond in the prompted language, and question answering accuracy. All LLMs we test respond faithfully and/or accurately for at least some languages beyond their intended use. Most models are more accurate when they respond faithfully. However, differences across models are large, and there is a long tail of languages where models are neither accurate nor faithful. We explore differences in tokenization as a potential explanation for our findings, identifying possible correlations that warrant further investigation.

This is a joint work by Carolin Holtermann, Paul Röttger, Timm Dill and Anne Lauscher. For further details feel free to check out our paper.

MultiQ on Huggingface

Our silver standard benchmark is available on Huggingface: see here

Getting Started

We conducted all our experiments with Python 3.10. Before getting started, make sure you install the requirements listed in the requirements.txt file.

pip install -r requirements.txt

Using GlotLID

For our experiments we used the model_v2.bin configuration of the publicly available GlotLID model see here. However, since the authors of this model are constantly making changes and covering even more languages with their model in further versions, it is worth visiting the GlotLID repository for the latest updates.

Repository Description

This repository contains all code and data needed to reproduce the experiments and results reported in our paper. All data files can be found in the data folder, while all relevant code files can be found in the src folder, both with corresponding readme files.

License

This project is licensed under the CC-BY-4.0 License - see the LICENSE.md file for details

multiq's People

Contributors

Stargazers

Watchers

multiq's Issues

GlotLID

Hi, Thanks for using GlotLID in your project.

Based on your feedback in the paper, and the way you used GlotLID we improved GlotLID into version 3.

For the reproducibility of your results, I want to ask you to change model.bin in your code to model_v2.bin. This ensures that it downloads the version you used to obtain your results, and you won't need to reproduce the results again (model.bin always refers to the latest model.). I think only detection_GlotLID.py#L113 needs to be changed for the sake of reproducibility.

model_path = hf_hub_download(repo_id="cis-lmu/glotlid", filename="model_v2.bin", cache_dir=None)

Version 3, based on your feedback, adds both Meiteilon (Manipuri) and Dogri, and also ensures to cover all other Indian languages, even in transliteration. You can see list of them here for v3: https://github.com/cisnlp/GlotLID/blob/main/languages-v3.md

Also, I've seen in your code that there seems to be a hard time managing ISO codes. In the v3 design, we decided to make labels more exclusive of each other. For this reason, some of the "macro" languages that we already cover a good variety of "individual" languages are deleted. Additionally, if two labels are very close and make predictions change a lot, we decided to merge them or delete one of them.

Recommend Projects

paul-rottger / multiq Goto Github PK

multiq's Introduction

Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ

MultiQ on Huggingface

Getting Started

Using GlotLID

Repository Description

License

multiq's People

Contributors

Stargazers

Watchers

multiq's Issues

GlotLID

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent