After the import "from sacremoses import MosesTokenizer" the following error occurs:</

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Error when trying to import Tokenizer about sacremoses HOT 11 CLOSED

hplt-project commented on May 18, 2024

Error when trying to import Tokenizer

from sacremoses.

Comments (11)

lukedorney commented on May 18, 2024 1

a quick workaround is explicitly stating the encoding as being utf-8 in corpus.py line 37 i.e. change
with open(self.datadir+category+'.txt') as fin:
to
with open(self.datadir+category+'.txt', encoding='utf-8') as fin:
although this only works with python 3 (with py2 you'd need to use codecs)

from sacremoses.

alvations commented on May 18, 2024

Interesting, Windows is reading it as cp1252 instead of utf8. Are you using Python 3 or 2?

from sacremoses.

janwendt commented on May 18, 2024

3.6

from sacremoses.

alvations commented on May 18, 2024

Very interesting now!

Looks like an upsteam bug/feature in CPython vs Windows... https://stackoverflow.com/questions/42070668/python-3-default-encoding-cp1252

My suggestion is to set the proper locale before the Python interpreter.

If you're on cygwin: https://stackoverflow.com/questions/24255407/permanently-set-python-path-for-anaconda-within-cygwin

If natively and globally on windows, see https://www.java.com/en/download/help/locale.xml

from sacremoses.

alvations commented on May 18, 2024

BTW, do you get the same when you import nltk?

from sacremoses.

sleighsoft commented on May 18, 2024

I have the same issue. But not when importing nltk

from sacremoses.

alvations commented on May 18, 2024

Are you using Windows too?

from sacremoses.

sleighsoft commented on May 18, 2024

Yes, Windows 10. Python 3.6.5

from sacremoses.

alvations commented on May 18, 2024

@lukedorney Hmmm.. It's weird that in Python3 the default encoding is already utf8 but Windows is doing something strange in the locale such that it's not the default.

from sacremoses.

alvations commented on May 18, 2024

@lukedorney @janwendt @sleighsoft I've added the patch and updated the package.

Please tell me if you still face the same problems after

pip install -U sacremoses

from sacremoses.

alvations commented on May 18, 2024

Going to close this issue. If there's any error in Windows from encoding problems again, please feel free to reopen this issue.

from sacremoses.

Recommend Projects

Error when trying to import Tokenizer about sacremoses HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent