When evaluating XQuAD, it uses the original SQuAD evaluation s. However, the

Scores before changing the <div class="snippet-clipboard-content notranslat

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

XQuAD can have a better evaluation about xtreme HOT 4 CLOSED

google-research commented on July 30, 2024

XQuAD can have a better evaluation

from xtreme.

Comments (4)

Liangtaiwan commented on July 30, 2024 1

I'm sure that the difference is large on Chinese.
Before multilingual question answering dataset released, I do some zero-shot reading comprehension experiment on DRCD (Chinese) and KorQuAD (Korean).

The result is reported in https://arxiv.org/pdf/1909.09587.pdf.
If the evaluation script is modified as here, I got 'exact': 66.71396140749148, 'f1': 78.41471541616556 on DRCD. Without modified, I got "exact": 66.71396140749148, "f1": 66.71396140749148, on DRCD.

from xtreme.

Liangtaiwan commented on July 30, 2024

Scores before changing the script

XQuAD
  en {"exact_match": 69.15966386554622, "f1": 81.326960111499}
  es {"exact_match": 50.84033613445378, "f1": 71.9177204984632}
  de {"exact_match": 49.15966386554622, "f1": 66.40521293685734}
  el {"exact_match": 31.428571428571427, "f1": 47.04197672432378}
  ru {"exact_match": 51.34453781512605, "f1": 68.7757929521295}
  tr {"exact_match": 29.831932773109244, "f1": 45.22649443514344}
  ar {"exact_match": 43.94957983193277, "f1": 60.508341054177116}
  vi {"exact_match": 13.193277310924369, "f1": 31.339180641908623}
  th {"exact_match": 18.65546218487395, "f1": 27.48267185662145}
  zh {"exact_match": 48.90756302521008, "f1": 58.35407496331861}
  hi {"exact_match": 26.386554621848738, "f1": 41.85833342624981}

Scores after changing the script

XQuAD
  en {"exact_match": 69.15966386554622, "f1": 81.12936130528962}
  es {"exact_match": 50.84033613445378, "f1": 70.39142998082033}
  de {"exact_match": 49.15966386554622, "f1": 65.50942887082724}
  el {"exact_match": 31.428571428571427, "f1": 56.924266644045474}
  ru {"exact_match": 51.34453781512605, "f1": 73.58736598799115}
  tr {"exact_match": 29.831932773109244, "f1": 47.694480985740995}
  ar {"exact_match": 43.94957983193277, "f1": 69.68323791997037}
  vi {"exact_match": 13.193277310924369, "f1": 38.22218477944847}
  th {"exact_match": 18.65546218487395, "f1": 41.452670791585874}
  zh {"exact_match": 48.90756302521008, "f1": 66.19815379584597}
  hi {"exact_match": 26.386554621848738, "f1": 52.00823001166374}

from xtreme.

sebastianruder commented on July 30, 2024

Thanks for the note. I had experimented with using the MLQA evaluation script for XQuAD but only observed marginal differences in some experiments (as mentioned here). If the differences are indeed larger, we might consider updating the evaluation script. What model did you use to obtain the scores?

from xtreme.

Liangtaiwan commented on July 30, 2024

Hi @sebastianruder, I used bert-base-multilingaul-cased.

However, as I mentioned in #8 (comment), there is a bug in scripts/*_qa.sh.
In the abovementioned result, I do not remove --do_lower_case argument.

from xtreme.

XQuAD can have a better evaluation about xtreme HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent