Git Product home page Git Product logo

xquad's Introduction

XQuAD

XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from the development set of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into ten languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Consequently, the dataset is entirely parallel across 11 languages.

For more information on how the dataset was created, refer to our paper, On the Cross-lingual Transferability of Monolingual Representations.

All files are in json format following the SQuAD dataset format. A parallel example in XQuAD in English, Spanish, and Chinese can be seen in the image below. The full dataset consists of 240 such parallel instances in 11 languages.

An example from XQuAD

Data

This directory contains files in the following languages:

  • Arabic: xquad.ar.json
  • German: xquad.de.json
  • Greek: xquad.el.json
  • English: xquad.en.json
  • Spanish: xquad.es.json
  • Hindi: xquad.hi.json
  • Russian: xquad.ru.json
  • Thai: xquad.th.json
  • Turkish: xquad.tr.json
  • Vietnamese: xquad.vi.json
  • Chinese: xquad.zh.json

As the dataset is based on SQuAD v1.1, there are no unanswerable questions in the data. We chose this setting so that models can focus on cross-lingual transfer.

We show the average number of tokens per paragraph, question, and answer for each language in the table below. The statistics were obtained using Jieba for Chinese and the Moses tokenizer for the other languages.

en es de el ru tr ar vi th zh hi
Paragraph 142.4 160.7 139.5 149.6 133.9 126.5 128.2 191.2 158.7 147.6 232.4
Question 11.5 13.4 11.0 11.7 10.0 9.8 10.7 14.8 11.5 10.5 18.7
Answer 3.1 3.6 3.0 3.3 3.1 3.1 3.1 4.5 4.1 3.5 5.6

Training and evaluation

In order to evaluate on XQuAD, models should be trained on the SQuAD v1.1 training file. which can be downloaded from here. Model validation similarly should be conducted on the SQuAD v1.1 validation file.

For evaluation, we use the official SQuAD evaluate-v1.1.py script, which can be obtained from here. Note that the SQuAD evaluation script normalises the answer based on heuristics that are specific to English. We have observed language-specific normalisation heuristics only to have a marginal impact on performance, which is why we use the English SQuAD v1.1 evaluation script for convenience.

Baselines

We show results using baseline methods in the tables below. We directly fine-tune mBERT and XLM-R Large on the English SQuAD v1.1 training data and evaluate them via zero-shot transfer on the XQuAD test datasets. For translate-train, we fine-tune mBERT on the SQuAD v1.1 training data, which we automatically translate to the target language. For translate-test, we fine-tune BERT-Large on the SQuAD v1.1 training set and evaluate it on the XQuAD test set of the target language, which we automatically translate to English. Note that results with translate-test are not directly comparable as we drop a small number (less than 3%) of the test examples.

F1 scores:

Model en ar de el es hi ru th tr vi zh avg
mBERT 83.5 61.5 70.6 62.6 75.5 59.2 71.3 42.7 55.4 69.5 58.0 64.5
XLM-R Large 86.5 68.6 80.4 79.8 82.0 76.7 80.1 74.2 75.9 79.1 59.3 76.6
Translate-train mBERT 83.5 68.0 75.6 70.0 80.2 69.6 75.0 36.9 68.9 75.6 66.2 70.0
Translate-test BERT-L 87.9 73.7 79.8 79.4 82.0 74.9 79.9 64.6 67.4 76.3 73.7 76.3

EM scores:

Model en ar de el es hi ru th tr vi zh avg
mBERT 72.2 45.1 54.0 44.9 56.9 46.0 53.3 33.5 40.1 49.6 48.3 49.4
XLM-R Large 75.7 49.0 63.4 61.7 63.9 59.7 64.3 62.8 59.3 59.0 50.0 60.8
Translate-train mBERT 72.2 51.1 60.7 53.0 63.1 55.4 59.7 33.5 54.8 56.2 56.6 56.0
Translate-test BERT-L 77.1 58.8 66.7 65.5 68.4 60.1 66.7 50.0 49.6 61.5 59.1 62.1

Best practices

XQuAD is intended as an evaluation corpus for zero-shot cross-lingual transfer. Evaluation on the test data should ideally only be conducted at the very end of the experimentation in order to avoid overfitting to the data.

If you are evaluating on XQuAD in the zero-shot setting, please state explicitly your experimental settings, particularly what monolingual and cross-lingual data you used for pre-training and fine-tuning your model.

Reference

If you use this dataset, please cite [1]:

[1] Artetxe, M., Ruder, S., & Yogatama, D. (2019). On the cross-lingual transferability of monolingual representations. arXiv preprint arXiv:1910.11856.

@article{Artetxe:etal:2019,
      author    = {Mikel Artetxe and Sebastian Ruder and Dani Yogatama},
      title     = {On the cross-lingual transferability of monolingual representations},
      journal   = {CoRR},
      volume    = {abs/1910.11856},
      year      = {2019},
      archivePrefix = {arXiv},
      eprint    = {1910.11856}
}

License

This dataset is distributed under the CC BY-SA 4.0 license.

This is not an officially supported Google product.

xquad's People

Contributors

dyogatama avatar sebastianruder avatar derpson avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.