<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Expected behavior for zh and ko: <div class="snippet-clipboard-content notranslate

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Looks like it's the the unichars list and the <code c

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Tokenization for Hindi (e.g. `क्या`) is weird about sacremoses HOT 6 OPEN

hplt-project commented on May 18, 2024

Tokenization for Hindi (e.g. `क्या`) is weird

from sacremoses.

Comments (6)

alvations commented on May 18, 2024 2

Expected behavior for zh and ko:

$ echo "记者 应谦 美国"  | ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l zh 
Tokenizer Version 1.1
Language: zh
Number of threads: 1
记者 应谦 美国

$ echo ""세계 에서 가장 강력한""  | ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ko 
Tokenizer Version 1.1
Language: ko
Number of threads: 1
WARNING: No known abbreviations for language 'ko', attempting fall-back to English version...
세계 에서 가장 강력한

from sacremoses.

alvations commented on May 18, 2024 1

@johnfarina Thanks for spotting that! The latest PR should #60 resolve the CJK issues.

The Hindi one is a little more complicated, so leaving this PR open.

pip install -U sacremoses>=0.0.22

from sacremoses.

johnfarina commented on May 18, 2024 1

Oh wow, comment on a github issue, go to bed, wake up, bug is fixed! Thanks so much @alvations !!

from sacremoses.

johnfarina commented on May 18, 2024

The same is true for both Chinese and Korean as well. sacremoses splits all characters:

Here's some Chinese:

>>> mt = MosesTokenizer(lang='zh')
>>> mt.tokenize("记者 应谦 美国")
['记', '者', '应', '谦', '美', '国']

And some Korean:

mt = MosesTokenizer(lang='ko')
mt.tokenize("세계 에서 가장 강력한")
['세', '계', '에', '서', '가', '장', '강', '력', '한']

Which is a shame, as I'd really like to use sacremoses as the tokenizer with LASER instead of using subprocess and temp files to call the moses perl scripts.

from sacremoses.

alvations commented on May 18, 2024

Looks like it's the the unichars list and the perluniprops list of Alphanumeric is a little different.

The issue comes from https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L420 where the non-alphanumeric characters are padded with spaces.

It looks like the \p{IsAlnum} includes the CJK:

$ echo "记者 应谦 美国" | sed "s/([^\p{IsAlnum}\s\.\'\`\,\-])/ $1 /g"
记者 应谦 美国

But when we check unichars, it's missing:

$ unichars '\p{Alnum}' | cut -f2 -d' ' | grep "记"

Using the unichars -au option works:

$ unichars -au '\p{Alnum}' | cut -f2 -d' ' | grep "记"
记

Note: see https://webcache.googleusercontent.com/search?q=cache:bmLqeEnWJa0J:https://codeday.me/en/qa/20190306/8531.html+&cd=6&hl=en&ct=clnk&gl=sg

from sacremoses.

mtresearcher commented on May 18, 2024

@alvations any update on the hindi tokenization issue?

from sacremoses.

Tokenization for Hindi (e.g. `क्या`) is weird about sacremoses HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent