hplt-project / sacremoses Goto Github PK

Python port of Moses tokenizer, truecaser and normalizer

License: MIT License

Python 85.08% Smalltalk 1.07% Emacs Lisp 9.65% JavaScript 0.47% NewLisp 0.90% Perl 0.74% Ruby 0.93% Slash 0.16% SystemVerilog 0.40% ActionScript 0.28% OCaml 0.30%

nlp machine-translation tokenizer

sacremoses's Introduction

Sacremoses

License

MIT License.

Install

pip install -U sacremoses

NOTE: Sacremoses only supports Python 3 now (sacremoses>=0.0.41). If you're using Python 2, the last possible version is sacremoses==0.0.40.

Usage (Python)

Tokenizer and Detokenizer

>>> from sacremoses import MosesTokenizer, MosesDetokenizer
>>> mt = MosesTokenizer(lang='en')
>>> text = 'This, is a sentence with weird\xbb symbols\u2026 appearing everywhere\xbf'
>>> expected_tokenized = 'This , is a sentence with weird \xbb symbols \u2026 appearing everywhere \xbf'
>>> tokenized_text = mt.tokenize(text, return_str=True)
>>> tokenized_text == expected_tokenized
True


>>> mt, md = MosesTokenizer(lang='en'), MosesDetokenizer(lang='en')
>>> sent = "This ain't funny. It's actually hillarious, yet double Ls. | [] < > [ ] & You're gonna shake it off? Don't?"
>>> expected_tokens = ['This', 'ain', '&apos;t', 'funny', '.', 'It', '&apos;s', 'actually', 'hillarious', ',', 'yet', 'double', 'Ls', '.', '&#124;', '&#91;', '&#93;', '&lt;', '&gt;', '&#91;', '&#93;', '&amp;', 'You', '&apos;re', 'gonna', 'shake', 'it', 'off', '?', 'Don', '&apos;t', '?']
>>> expected_detokens = "This ain't funny. It's actually hillarious, yet double Ls. | [] < > [] & You're gonna shake it off? Don't?"
>>> mt.tokenize(sent) == expected_tokens
True
>>> md.detokenize(tokens) == expected_detokens
True

Truecaser

>>> from sacremoses import MosesTruecaser, MosesTokenizer

# Train a new truecaser from a 'big.txt' file.
>>> mtr = MosesTruecaser()
>>> mtok = MosesTokenizer(lang='en')

# Save the truecase model to 'big.truecasemodel' using `save_to`
>> tokenized_docs = [mtok.tokenize(line) for line in open('big.txt')]
>>> mtr.train(tokenized_docs, save_to='big.truecasemodel')

# Save the truecase model to 'big.truecasemodel' after training
# (just in case you forgot to use `save_to`)
>>> mtr = MosesTruecaser()
>>> mtr.train('big.txt')
>>> mtr.save_model('big.truecasemodel')

# Truecase a string after training a model.
>>> mtr = MosesTruecaser()
>>> mtr.train('big.txt')
>>> mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES")
['the', 'adventures', 'of', 'Sherlock', 'Holmes']

# Loads a model and truecase a string using trained model.
>>> mtr = MosesTruecaser('big.truecasemodel')
>>> mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES")
['the', 'adventures', 'of', 'Sherlock', 'Holmes']
>>> mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES", use_known=True)
['the', 'ADVENTURES', 'OF', 'SHERLOCK', 'HOLMES']
>>> mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES", return_str=True)
'the adventures of Sherlock Holmes'

Normalizer

>>> from sacremoses import MosesPunctNormalizer
>>> mpn = MosesPunctNormalizer()
>>> mpn.normalize('THIS EBOOK IS OTHERWISE PROVIDED TO YOU "AS-IS."')
'THIS EBOOK IS OTHERWISE PROVIDED TO YOU "AS-IS."'

Usage (CLI)

Since version 0.0.42, the pipeline feature for CLI is introduced, thus there are global options that should be set first before calling the commands:

language
processes
encoding
quiet

$ pip install -U sacremoses>=0.1

$ sacremoses --help
Usage: sacremoses [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...

Options:
  -l, --language TEXT      Use language specific rules when tokenizing
  -j, --processes INTEGER  No. of processes.
  -e, --encoding TEXT      Specify encoding of file.
  -q, --quiet              Disable progress bar.
  --version                Show the version and exit.
  -h, --help               Show this message and exit.

Commands:
  detokenize
  detruecase
  normalize
  tokenize
  train-truecase
  truecase

Pipeline

Example to chain the following commands:

normalize with -c option to remove control characters.
tokenize with -a option for aggressive dash split rules.
truecase with -a option to indicate that model is for ASR
- if big.truemodel exists, load the model with -m option,
- otherwise train a model and save it with -m option to big.truemodel file.
save the output to console to the big.txt.norm.tok.true file.

cat big.txt | sacremoses -l en -j 4 \
    normalize -c tokenize -a truecase -a -m big.truemodel \
    > big.txt.norm.tok.true

Tokenizer

$ sacremoses tokenize --help
Usage: sacremoses tokenize [OPTIONS]

Options:
  -a, --aggressive-dash-splits   Triggers dash split rules.
  -x, --xml-escape               Escape special characters for XML.
  -p, --protected-patterns TEXT  Specify file with patters to be protected in
                                 tokenisation.
  -c, --custom-nb-prefixes TEXT  Specify a custom non-breaking prefixes file,
                                 add prefixes to the default ones from the
                                 specified language.
  -h, --help                     Show this message and exit.


 $ sacremoses -l en -j 4 tokenize  < big.txt > big.txt.tok
100%|██████████████████████████████████| 128457/128457 [00:05<00:00, 24363.39it/s

 $ wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/basic-protected-patterns
 $ sacremoses -l en -j 4 tokenize -p basic-protected-patterns < big.txt > big.txt.tok
100%|██████████████████████████████████| 128457/128457 [00:05<00:00, 22183.94it/s

Detokenizer

$ sacremoses detokenize --help
Usage: sacremoses detokenize [OPTIONS]

Options:
  -x, --xml-unescape  Unescape special characters for XML.
  -h, --help          Show this message and exit.

 $ sacremoses -l en -j 4 detokenize < big.txt.tok > big.txt.tok.detok
100%|██████████████████████████████████| 128457/128457 [00:16<00:00, 7931.26it/s]

Truecase

$ sacremoses truecase --help
Usage: sacremoses truecase [OPTIONS]

Options:
  -m, --modelfile TEXT            Filename to save/load the modelfile.
                                  [required]
  -a, --is-asr                    A flag to indicate that model is for ASR.
  -p, --possibly-use-first-token  Use the first token as part of truecase
                                  training.
  -h, --help                      Show this message and exit.

$ sacremoses -j 4 truecase -m big.model < big.txt.tok > big.txt.tok.true
100%|██████████████████████████████████| 128457/128457 [00:09<00:00, 14257.27it/s]

Detruecase

$ sacremoses detruecase --help
Usage: sacremoses detruecase [OPTIONS]

Options:
  -j, --processes INTEGER  No. of processes.
  -a, --is-headline        Whether the file are headlines.
  -e, --encoding TEXT      Specify encoding of file.
  -h, --help               Show this message and exit.

$ sacremoses -j 4 detruecase  < big.txt.tok.true > big.txt.tok.true.detrue
100%|█████████████████████████████████| 128457/128457 [00:04<00:00, 26945.16it/s]

Normalize

$ sacremoses normalize --help
Usage: sacremoses normalize [OPTIONS]

Options:
  -q, --normalize-quote-commas  Normalize quotations and commas.
  -d, --normalize-numbers       Normalize number.
  -p, --replace-unicode-puncts  Replace unicode punctuations BEFORE
                                normalization.
  -c, --remove-control-chars    Remove control characters AFTER normalization.
  -h, --help                    Show this message and exit.

$ sacremoses -j 4 normalize < big.txt > big.txt.norm
100%|██████████████████████████████████| 128457/128457 [00:09<00:00, 13096.23it/s]

sacremoses's People

Contributors

Stargazers

Watchers

sacremoses's Issues

Truecaser Known Case Tokens

If a word is not the first word of the sentence, and the word was seen with this exact casing in the training material, the original script does not recase the word.

i.e.

perl train - truecaser.perl --model big.model --corpus big.txt

echo "THE ADVENTURES OF SHERLOCK HOLMES" | perl truecase.perl --model big.model
the ADVENTURES OF SHERLOCK HOLMES

Deprecation warning due to invalid escape sequences in Python 3.8

Deprecation warnings are raised due to invalid escape sequences in Python 3.8 . Below is a log of the warnings raised during compiling all the python files. Using raw strings or escaping them will fix this issue.

find . -iname '*.py'  | xargs -P 4 -I{} python -Wall -m py_compile {}

./sacremoses/tokenize.py:41: DeprecationWarning: invalid escape sequence \s
  PAD_NOT_ISALNUM = u"([^{}\s\.'\`\,\-])".format(IsAlnum), r" \1 "
./sacremoses/tokenize.py:46: DeprecationWarning: invalid escape sequence \-
  u"([{alphanum}])\-(?=[{alphanum}])".format(alphanum=IsAlnum),
./sacremoses/tokenize.py:85: DeprecationWarning: invalid escape sequence \$
  SYMBOLS = u"([;:@#\$%&{}{}])".format(IsSc, IsSo), r" \1 "
./sacremoses/tokenize.py:91: DeprecationWarning: invalid escape sequence \/
  u"([{alphanum}])\/([{alphanum}])".format(alphanum=IsAlnum),
./sacremoses/tokenize.py:193: DeprecationWarning: invalid escape sequence \.
  TRAILING_DOT_APOSTROPHE = "\.' ?$", " . ' "
./sacremoses/tokenize.py:196: DeprecationWarning: invalid escape sequence \S
  BASIC_PROTECTED_PATTERN_2 = '<\S+( [a-zA-Z0-9]+\="?[^"]")+ ?\/?>'
./sacremoses/tokenize.py:197: DeprecationWarning: invalid escape sequence \S
  BASIC_PROTECTED_PATTERN_3 = "<\S+( [a-zA-Z0-9]+\='?[^']')+ ?\/?>"
./sacremoses/tokenize.py:198: DeprecationWarning: invalid escape sequence \w
  BASIC_PROTECTED_PATTERN_4 = "[\w\-\_\.]+\@([\w\-\_]+\.)+[a-zA-Z]{2,}"
./sacremoses/tokenize.py:199: DeprecationWarning: invalid escape sequence \/
  BASIC_PROTECTED_PATTERN_5 = "(http[s]?|ftp):\/\/[^:\/\s]+(\/\w+)*\/[\w\-\.]+"
./sacremoses/tokenize.py:325: DeprecationWarning: invalid escape sequence \s
  self.PAD_NOT_ISALNUM = u"([^{}\s\.'\`\,\-])".format(self.IsAlnum), r" \1 "
./sacremoses/tokenize.py:327: DeprecationWarning: invalid escape sequence \-
  u"([{alphanum}])\-(?=[{alphanum}])".format(alphanum=self.IsAlnum),
./sacremoses/tokenize.py:331: DeprecationWarning: invalid escape sequence \/
  u"([{alphanum}])\/([{alphanum}])".format(alphanum=self.IsAlnum),
./sacremoses/tokenize.py:699: DeprecationWarning: invalid escape sequence \(
  elif re.search(u"^[" + self.IsSc + u"\(\[\{\¿\¡]+$", token):

Comma after a number at the end of the sentence not split

Commas after numbers at the end of a sentence are not split away, but they should be.
For example in Sicherung dieser Daten erstellen ( wie unter Abschnitt 4.1.1, the last comma should be split away.

The third comma rule of the original perl script (Line 298) is missing in this implementation.

Fix:
On line 60 in tokenize.py

    COMMA_SEPARATE_3 = u'([{}])[,]$'.format(IsN), r'\1 , '

On line 349 add the rule:

        for regexp, substitution in [self.COMMA_SEPARATE_1, self.COMMA_SEPARATE_2, self.COMMA_SEPARATE_3]:
            text = re.sub(regexp, substitution, text)

pip3 install error: AttributeError: '_io.BufferedWriter' object has no attribute 'encoding'

ERROR: Exception:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/py_compile.py", line 143, in compile
    _optimize=optimize)
  File "<frozen importlib._bootstrap_external>", line 791, in source_to_code
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/private/var/folders/g0/5zwy4mtx7579v5x6rxqb083r0000gn/T/pip-unpacked-wheel-l2i9ti3m/sacremoses/sent_tokenize.py", line 69
    if re.search(IS_EOS, token)
                              ^
SyntaxError: invalid syntax

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/compileall.py", line 159, in compile_file
    invalidation_mode=invalidation_mode)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/py_compile.py", line 147, in compile
    raise py_exc
py_compile.PyCompileError:   File "/private/var/folders/g0/5zwy4mtx7579v5x6rxqb083r0000gn/T/pip-unpacked-wheel-l2i9ti3m/sacremoses/sent_tokenize.py", line 69
    if re.search(IS_EOS, token)
                              ^
SyntaxError: invalid syntax


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pip/_internal/cli/base_command.py", line 186, in _main
    status = self.run(options, args)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pip/_internal/commands/install.py", line 404, in run
    use_user_site=options.use_user_site,
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pip/_internal/req/__init__.py", line 71, in install_given_reqs
    **kwargs
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pip/_internal/req/req_install.py", line 815, in install
    warn_script_location=warn_script_location,
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pip/_internal/operations/install/wheel.py", line 614, in install_wheel
    warn_script_location=warn_script_location,
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pip/_internal/operations/install/wheel.py", line 338, in install_unpacked_wheel
    compileall.compile_dir(source, force=True, quiet=True)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/compileall.py", line 97, in compile_dir
    legacy, optimize, invalidation_mode):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/compileall.py", line 169, in compile_file
    msg = err.msg.encode(sys.stdout.encoding,
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pip/_internal/utils/misc.py", line 554, in encoding
    return self.orig_stream.encoding
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 409, in __getattr__
    return getattr(self.stream, name)
AttributeError: '_io.BufferedWriter' object has no attribute 'encoding'

Getting this error when doing pip3 install -U sacremoses and also when installing other packages which rely on this package.

Tag releases

setup.py says this is version 0.0.35, but there are no tags in this repository. Please create tags for your releases.

Issue in handles_nonbreaking_prefixes

There is an issue in handles_nonbreaking_prefixes, more specifically on line 271, where you check if the next token is all lowercase.

The original perl script just checks if the first character of the next token is lowercase. (as stated here: https://stackoverflow.com/questions/42126922/what-does-pre-pre-pisalpha-mean-in-the-moses-tokenizer)

So for example 4.5.1. dist-upgrade schlägt fehl mit is detokenized to 4.5.1 . dist-upgrade schlägt fehl mit because the hyphen is not in the list of lower case characters.

Class issue for Python2

@alvations
pytorch/text (torchtext) library is looking to migrate to sacremoses
but we are experiencing following error.

https://travis-ci.org/pytorch/text/jobs/387871950

Flag --protected from original Moses tokenizer

The original Moses tokenizer supports the --protected flag. It's effect is to accept a file with a list of regular expression that should be protected from tokenization.

Under the hoods, the tokenizer masks each match of the regexes, then tokenizes, then unmasks.

Is this functionality in the roadmap of sacremoses?

Difference between regex and re library

Would regex be faster? Or more stable?

Would changing from re to regex break the existing rules?

Add lowercase script?

Moses scripts included a useful lowercasing script. Are there any plans to add this?

Truecaser for foreign languages!

Can the train-truecaser can be applied to any txt file containing foreign characters i.e., for foreign language truecasing such as german/spanish etc..
# Some predefined words that will always be in lowercase. self.ALWAYS_LOWER =

Due to such blocks in the code?

Error when trying to import Tokenizer

After the import "from sacremoses import MosesTokenizer" the following error occurs:

Traceback (most recent call last):
 File "moses.py", line 1, in <module>
   from sacremoses import MosesTokenizer
 File "C:\Anaconda3\lib\site-packages\sacremoses\__init__.py", line 3, in <module>
   from sacremoses.tokenize import *
 File "C:\Anaconda3\lib\site-packages\sacremoses\tokenize.py", line 16, in <module>
   class MosesTokenizer:
 File "C:\Anaconda3\lib\site-packages\sacremoses\tokenize.py", line 22, in MosesTokenizer
   IsN = text_type(''.join(perluniprops.chars('IsN')))
 File "C:\Anaconda3\lib\site-packages\sacremoses\corpus.py", line 38, in chars
   for ch in fin.read().strip():
 File "C:\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
   return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 65: character maps to <undefined>

Documentation

Now that we have more than tokenization, we need some proper documentation.

Tokenization for Hindi (e.g. `क्या`) is weird

>>> from sacremoses import MosesTokenizer
>>> mt = MosesTokenizer()
>>> mt.tokenize('क्या')
['क', '्', 'या']

NameError: name 'words' is not defined in truecaser

Hi, thank you for your great work.

I was using sacremoses truecase and encountered following error:

  File "/home/kiyono/.pyenv/versions/miniconda-3.9.1/envs/chainer4/bin/sacremoses", line 11, in <module>
    sys.exit(cli())
  File "/home/kiyono/.pyenv/versions/miniconda-3.9.1/envs/chainer4/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/kiyono/.pyenv/versions/miniconda-3.9.1/envs/chainer4/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/kiyono/.pyenv/versions/miniconda-3.9.1/envs/chainer4/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/kiyono/.pyenv/versions/miniconda-3.9.1/envs/chainer4/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/kiyono/.pyenv/versions/miniconda-3.9.1/envs/chainer4/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/kiyono/.pyenv/versions/miniconda-3.9.1/envs/chainer4/lib/python3.6/site-packages/sacremoses/cli.py", line 195, in truecase_file
    print(moses.truecase(line, return_str=True), end="\n", file=fout)
  File "/home/kiyono/.pyenv/versions/miniconda-3.9.1/envs/chainer4/lib/python3.6/site-packages/sacremoses/truecase.py", line 265, in truecase
    tokens = self.split_xml(text)
  File "/home/kiyono/.pyenv/versions/miniconda-3.9.1/envs/chainer4/lib/python3.6/site-packages/sacremoses/truecase.py", line 349, in split_xml
    and len(words) > 0
NameError: name 'words' is not defined

I have not looked into the details of the code, but it seems that the variable words is indeed not defined in the method.
Maybe it should be tokens instead?

Encoding problem when calling from python script in Windows

Hi, @alvations.
A student of mine and I are using the truecaser from a Python 3 script in windows. The script makes sure that all files are opened in utf-8 by redefining open as follows:

open = functools.partial(open, encoding='utf8')

However, when the script executes this excerpt of code:

	for suffix in [args.l1, args.l2] :
		mtc = MosesTruecaser()
		mtc.train([line.split() for line in open("train."+changes_applied+suffix)], save_to="truecasemodel."+suffix)
		for prefix in ["train.", "dev.", "test."] :
			with open(prefix+changes_applied+"true."+suffix,"w") as outfile :
				[outfile.write(mtc.truecase(line, return_str=True)+"\n") for line in open(prefix+changes_applied+suffix)]

The error occurs in the line

[outfile.write(mtc.truecase(line, return_str=True)+"\n") for line in open(prefix+changes_applied+suffix)]

And the traceback is

(prueba) C:\Users\Guest\nmt-for-translators\globalvoices\data>python ../../code/prepare.py GlobalVoices.es-fr fr es 130000 1500 1500 10000 --tokenize --truecase --bpe
Ficheros 'tokenizados'
Traceback (most recent call last):
  File "../../code/prepare.py", line 132, in <module>
    mtc.train([line.split() for line in open("train."+changes_applied+suffix)], save_to="truecasemodel."+suffix)
  File "C:\Users\Guest\AppData\Local\Programs\Python\Python36\lib\site-packages\sacremoses\truecase.py", line 142, in train
    self.model = self._train(documents, save_to, possibly_use_first_token, processes, progress_bar=progress_bar)
  File "C:\Users\Guest\AppData\Local\Programs\Python\Python36\lib\site-packages\sacremoses\truecase.py", line 132, in _train
    self._save_model_from_casing(casing, save_to)
  File "C:\Users\Guest\AppData\Local\Programs\Python\Python36\lib\site-packages\sacremoses\truecase.py", line 324, in _save_model_from_casing
    c
  File "C:\Users\Guest\prueba\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u011f' in position 4: character maps to <undefined>

We found that '\u011f' is the character 'ğ' in 'Erdoğan', position 4 in the word. Clearly, the error occurs when printing to a file

print(' '.join(tokens_counts), end='\n', file=fout)

The file fout is opened as follows in truecase.py

with open(filename, 'w') as fout:

We have tried setting PYTHONIOENCODING before calling the script, to no avail.

Our python is:

Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

and it was installed from This page. It also fails with 3.6.3 on a different machine.
We are using it inside a virtualenv.
We don't seem to find a way to solve this without modifying your code.
Thanks a million for your help.

Apostrophes in English

I just reported the same issue to the mosestokenizer package: luismsgomes/mosestokenizer#1

The problem is that detokenization fails to handle apostrophes correcly:

import sacremoses                                                                                                                                     
tokens = 'yesterday ’s reception'.split(' ')                                                                                                          
print(sacremoses.MosesDetokenizer('en').detokenize(tokens))

prints yesterday ’s reception

Add a license file and copyright infomation to the repository

Hi, I'm using the python port of Moses Tokenizer in my project, I would appreciate it if you add a LICENSE file and copyright information to the repository.

And what's the official name for this python port of Moses Tokenizer, sacremoses (the module name) or mosestokenizer (as in the title of the doc)?

Truecaser crashes for large corpora (>8M segments)

We used the truecaser for some of our corpora with >8M segments. There are some issues when training a truecaser for larger corpora:

Using joblib.Parallel causes a huge memory footprint even when used with a single process. i.e. >32GB of memory for our 8M corpus.
The training never seems to stop (cancelled after 24h). The progressbar finishes after about 20minutes.

In our particular case we fixed the problem by using map instead of Parallel for single processes in this function:

https://github.com/alvations/sacremoses/blob/f3780b392368ba09106098354aca706f8476cdb6/sacremoses/util.py#L169-L171

 def parallelize_preprocess(func, iterator, processes, progress_bar=False): 
     iterator = tqdm(iterator) if progress_bar else iterator 
     if processes <= 1:
          return map(func, iterator)
     return Parallel(n_jobs=processes)(delayed(func)(line) for line in iterator)

Weird results for Tamil and Russian tokenization

Hi, the results for Tamil tokenization is weird:

>>> import sacremoses
>>> TEXT_TAM = 'தமிழ் மொழி (Tamil language) தமிழர்களினதும், தமிழ் பேசும் பலரதும் தாய்மொழி ஆகும்.'
>>> EXPECTED_TOKENS = ['தமிழ்', 'மொழி', '(', 'Tamil', 'language', ')', 'தமிழர்களினதும்', ',', 'தமிழ்', 'பேசும்', 'பலரதும்', 'தாய்மொழி', 'ஆகும்', '.']

>>> t = sacremoses.MosesTokenizer(lang = 'ta')
>>> t.tokenize(TEXT_TAM)
['தமிழ', '்', 'மொழி', '(', 'Tamil', 'language', ')', 'தமிழர', '்', 'களினதும', '்', ',', 'தமிழ', '்', 'பேசும', '்', 'பலரதும', '்', 'தாய', '்', 'மொழி', 'ஆகும', '்', '.']
>>> assert t.tokenize(TEXT_TAM) == EXPECTED_TOKENS
Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    assert t.tokenize(TEXT_TAM) == EXPECTED_TOKENS
AssertionError

Loading existing truecase models is broken

observed counts should be casted to int for the _casing_to_model method to work.

casing[token.lower()][token] = int(count)

Else, the counter sorting is done lexically, resulting in counts with less digits than the second highest count to be set as best, e.g. "400"<"40"

Dropping Python 2.7 support.

Version 0.0.40 will be last stable version for Python 2.7 users. This is to reduce additional dev work when supporting regexes on deprecated Python version, e.g. #84 and #94

We encourage all users to use Python 3 with sacremoses from henchforth. If any Python 2.7 users faces any issues with 0.0.40, please use this issue to track the problems and only critical issues will be fixed for Python 2.7 on 0.0.40.

New features from 0.0.41 onwards will not be backported.

Space deduplication

Moses default behavior is to use a regex to deduplicate spaces, https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L507

But in Python, this can be don't by doing " ".join(text.split())

Would it be appropriate to change the space de-duplication behavior in sacremoses to use " ".join(text.split()) ?

How to extend BASIC_PROTECTED_PATTERNS

I want to use BASIC_PROTECTED_PATTERNS and to extend it adding custom patterns like:

text = "Don't let me find that"
lang = "en"

mpn = MosesPunctNormalizer(lang = lang)
mtok = MosesTokenizer(lang = lang)

patterns =  MosesTokenizer.BASIC_PROTECTED_PATTERNS
contractionRegex = re.compile(r"^[a-zA-Z]+\'t", flags=re.I | re.X | re.UNICODE)
patterns.append( contractionRegex )

# normalize
text =  mpn.normalize(text.lower())
# tokenize
tokens = mtok.tokenize(text, escape=False, return_str=False, aggressive_dash_splits=False, protected_patterns = patterns)

I'm not sure I'm doing the right thing. Here I would like to protect the pattern for all english words that have contractions and accent like ain't, don't, etc, in order to avoid to tokenize in the TreeBank way

['don', '’', 't', 'let', 'me', 'find', 'that']

instead doing

['don't', 'let', 'me', 'find', 'that']

Strange behaviour of command-line detokenizer

"p.m." is not tokenized as in the original script.

I could not yet figure out why, but in the original script, the dot in p.m. at the end of a sentence is not split up, while with this port it is.

The original script even explicitly leaves out p.m from its nonbreaking prefixes, so i'd expect the behavior seen in the port.

Is is_lower() a typo for islower() in truecase.py?

When I try to call MosesTruecaser().train with possibly_use_first_token=True I get this error:
AttributeError: 'str' object has no attribute 'is_lower'
for line 103 in truecase.py which is:
if token[0].is_lower():

Either token[0] is not intended to be of type str - or is_lower() should be islower()

Support of Moses tokenizer Perl scripts

In my tokenization pipeline I run several Moses perl scripts like

def TokenLine(line, lang='en', lower_case=True, romanize=False):
    assert lower_case, 'lower case is needed by all the models'
    roman = lang if romanize else 'none'
    tok = check_output(
            REM_NON_PRINT_CHAR
            + '|' + NORM_PUNC + lang
            + '|' + DESCAPE
            + '|' + MOSES_TOKENIZER + lang
            input=line,
            encoding='UTF-8',
            shell=True)
    return tok.strip()

where I have

MOSES_BDIR = os.path.join( TOOL_BASE_DIR, 'moses-tokenizer/tokenizer/')
MOSES_TOKENIZER = MOSES_BDIR + 'tokenizer.perl -q -no-escape -threads 20 -l '
NORM_PUNC = MOSES_BDIR + 'normalize-punctuation.perl -l '
DESCAPE = MOSES_BDIR + 'deescape-special-chars.perl'
REM_NON_PRINT_CHAR = MOSES_BDIR + 'remove-non-printing-char.perl'

So I run normalize-punctuation.perl, remove-non-printing-char.perl and tokenizer.perl. Are all these script supported by Sacremoses?

In my understanding I should do like

from sacremoses import MosesTokenizer

mtok = MosesTokenizer(lang='fr')
tokenized_docs = [mtok.tokenize(line) for line text.splitlines() ]

while from the command line options I can see the --xml-unescape to unescape special characters for XML, that should match the deescape-special-chars.perl script. Regarding the non printing chars from the remove-non-printing-char.perl my understanding is that are handled by the api without other options like

>>> mtok.tokenize("This , is a sentence with weird \xbb symbols \u2026 appearing everywhere \xbf")
['This', ',', 'is', 'a', 'sentence', 'with', 'weird', '»', 'symbols', '…', 'appearing', 'everywhere', '¿']
>>>

So what I did was having the signature of tokenize

def tokenize(self, text, aggressive_dash_splits=False, return_str=False, escape=True):

to use it like

mtok.tokenize(string, escape=True, return_str=True, aggressive_dash_splits=False)

because I want a string output (return_str=True), to not handle hypens (aggressive_dash_splits=False). Regarding the escaping of HTML chars in the perl version I don't know if escape=True of the python version, will match it, since here it handles the XML escaping.

Non-Breaking Prefixes are stripped

Just a note: In the original perl script, non-breaking prefixes are not stripped when loading them from the file.

In my opinion stripping non-breaking prefixes is wanted, because there can be some unwanted spaces that are very hard to spot in the nbp-files.

I actually found this issue because in nonbreaking_prefix.it on line 139 es is followed by a space, causing dots following es to be split up (using the perl implementation but not using this implementation)

Verbosity

It seems that the command line always prints a progress bar. I suggest one of the following options:

(preferred) it only print if --verbose|-v is specified (silent by default); or
that one could suppress this with --quiet|-q

Thanks for a great too, BTW

CLI not prompting when nothing is streaming into the command

While this works:

$ sacremoses train-truecase -m big.model -j 4 < big.txt.tok

It's sort of a gotcha when nothing is streaming in and the script just sits there, e.g.

$ sacremoses train-truecase -m big.model -j 4

Proposal: Show help if nothing streams in

No such file or directory: nonbreaking_prefix.es

2019-07-04 16:06:14 - ERROR - No such file or directory: '/usr/local/lib/python3.6/dist-packages/sacremoses/data/nonbreaking_prefixes/nonbreaking_prefix.es'

Shall I manually install these files?

Truecaser Test dependency on norvig.com/big.txt

norvig.com is currently down which is causing the tests in sacremoses/test/test_truecaser.py to fail if big.txt has not already been downloaded. I'm wondering if there is another source of the file that might be more reliable or whether it could be included in the repository (e.g. with Git LFS).

Unable to create MosesTokenizer object in Python 2.7

I am unable to create MosesTokenizer or MosesDetokenizer objects in Python 2.7.

>>> from sacremoses import MosesTokenizer        
>>> mt = MosesTokenizer()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sidhi/.local/lib/python2.7/site-packages/sacremoses/tokenize.py", line 209, in __init__
    super(MosesTokenizer, self).__init__()
TypeError: super() argument 1 must be type, not classobj

Is this module supported for Python 2.7?

Escape option on command line

I know when using sacremoses tokenize on python I can set it off, but how can I do it through command line?
Also, why is that the default option?

PyPI tarball contains code with syntax errors

The latest release on PyPI (0.0.38) contains code with syntax errors. In the sacremoses/sent_tokenize.py file, I see:

...
        for i, token in enumerate(iter(tokens[:-1])):                           
            if re.search(IS_EOS, token)

During installation, this leads to the following warning message:

running install_scripts
Installing sacremoses script to /Users/Adam/spack/opt/spack/darwin-catalina-x86_64/clang-11.0.3-apple/py-sacremoses-0.0.38-jrvwvizhy453s2pooq6udlw3b7my5v4e/bin
  File "Users/Adam/spack/opt/spack/darwin-catalina-x86_64/clang-11.0.3-apple/py-sacremoses-0.0.38-jrvwvizhy453s2pooq6udlw3b7my5v4e/lib/python3.7/site-packages/sacremoses/sent_tokenize.py", line 69
    if re.search(IS_EOS, token)
                              ^
SyntaxError: invalid syntax

I noticed that this file doesn't even exist in the GitHub repo. I also noticed that the GitHub repo does not have this 0.0.38 release. What's going on here?

Language option -l not available on CLI

Train truecaser CLI should have progress bar

Currently all other CLI shows some sort of a progressbar except for train-truecase, it's not hard to add a progress bar for CLI. Lets do that.

Bug for opening brackets

Opening brackets are not detokenized correctly.

The following testcase fails:

tokenizer = MosesTokenizer()
detokenizer = MosesDetokenizer()

text = "By the mid 1990s a version of the game became a Latvian television series (with a parliamentary setting, and played by Latvian celebrities)."

assert detokenizer.detokenize(tokenizer.tokenize(text)) == text

Update sacremoses.util.CJKChars and is_cjk

Currently sacremoses.util.is_cjk treats japanese kanas as CJK characters which I suppose should be excluded.

Maybe it is better to use https://en.wikipedia.org/wiki/Unicode_block as the reference instead of https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane (given in the docstring) and enumerate unicode code points of all character under "Hangul" and "Han" scripts.

And I'm not sure whether Tibetan characters and scripts like Nushu (女书) should be treated as CJK characters (I'm not an expert in unicode).

Testing CLI commands

The CLI commands should be tested in the CI. c.f. #36

Escaped characters not converted back into special characters after tokenization

Hi, I'm using SacreMoses 0.0.7, special characters like [ ] < > are escaped when the text is being tokenized (using MosesTokenizer.tokenize), and are left as is in the results (the examples in the doc show the same behavior).

Is this the expected behavior and is there any reason to do this? (For example, I would expect 's to be converted back to 's in the results.)

To add, MosesTokenizer.penn_tokenize would convert square brackets to -LSB- and -RSB-.

>>> import sacremoses
>>> text = 'English is a West Germanic language that was first spoken in early medieval England and eventually became a global lingua franca.[4][5]'
>>> moses_tokenizer = sacremoses.MosesTokenizer(lang = 'en')
>>> moses_tokenizer.tokenize(text)
['English', 'is', 'a', 'West', 'Germanic', 'language', 'that', 'was', 'first', 'spoken', 'in', 'early', 'medieval', 'England', 'and', 'eventually', 'became', 'a', 'global', 'lingua', 'franca', '.', '&#91;', '4', '&#93;', '&#91;', '5', '&#93;']
>>> moses_tokenizer.penn_tokenize(text)
['English', 'is', 'a', 'West', 'Germanic', 'language', 'that', 'was', 'first', 'spoken', 'in', 'early', 'medieval', 'England', 'and', 'eventually', 'became', 'a', 'global', 'lingua', 'franca', '.', '-LSB-', '4', '-RSB-', '-LSB-', '5', '-RSB-']

first call to MosesTokenizer.tokenize is very slow

As in, it takes several minutes. Seems to happen independent ot the specified lang.

In [1]: from sacremoses import MosesTokenizer

In [2]: mt = MosesTokenizer(lang='ko')

In [3]: %time mt.tokenize("세계 에서 가장 강력한")
CPU times: user 3min 3s, sys: 1.75 s, total: 3min 5s
Wall time: 3min 11s
Out[3]: ['세계', '에서', '가장', '강력한']

Subsequent calls perform as expected:

In [4]: %time mt.tokenize("세계 에서 가장 강력한")
CPU times: user 819 µs, sys: 0 ns, total: 819 µs
Wall time: 823 µs
Out[4]: ['세계', '에서', '가장', '강력한']

Latest version of sacremoses (0.0.22). Is this a problem for anyone else?

Initial capitals never regenerated (truecaser)

Do sentences have to be delimited in some way? I have trained the truecaser with a 3032679-word tokenized text in Spanish (1 sentence per line). It generates a model which has 102972 entries (is it a unigram-based truecaser?). Then I use it to truecase a similar text in Spanish which has all been lowercased. The case of many proper nouns, etc., is correctly recovered, but nothing occurs at the beginning of sentences. Am I doing something wrong? Thanks a million!

Truecaser Save to File

Saving a previously trained model like shown in the readme does not work.

mtr = MosesTruecaser()
mtr.train('big.txt')  # should be mtr.train_from_file('big.txt')?
mtr._save_model('big.truecasemodel')

Bug in final apostrophe!!

Bug in final apostrophe from original Moses!!

Original Moses:

$ cat in.txt 
dip dye hand-tufted ivory / navy area rug, 8' x 10'
azzura hill hand-tufted ivory indoor/outdoor area rug, 7'6" x 9'6"
caterine hand-tufted ivory area rug, 9' x 12'
de         cor hand-tufted ivory/navy area rug, 8' x 10'

$ cat in.txt | ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en 
dip dye hand-tufted ivory / navy area rug , 8 &apos; x 10&apos;
azzura hill hand-tufted ivory indoor / outdoor area rug , 7 &apos; 6 &quot; x 9 &apos; 6 &quot;
caterine hand-tufted ivory area rug , 9 &apos; x 12&apos;
de cor hand-tufted ivory / navy area rug , 8 &apos; x 10&apos;

SacreMoses output:

>>> from sacremoses import MosesTokenizer
>>> mt = MosesTokenizer()

>>> x = """dip dye hand-tufted ivory / navy area rug, 8' x 10'
... azzura hill hand-tufted ivory indoor/outdoor area rug, 7'6" x 9'6"
... caterine hand-tufted ivory area rug, 9' x 12'
... de         cor hand-tufted ivory/navy area rug, 8' x 10'
... """

>>> for sent in x.split('\n'):
...    print(mt.tokenize(sent.strip(), return_str=True))
... 
dip dye hand-tufted ivory / navy area rug , 8 &apos; x 10&apos;
azzura hill hand-tufted ivory indoor / outdoor area rug , 7 &apos; 6 &quot; x 9 &apos; 6 &quot;
caterine hand-tufted ivory area rug , 9 &apos; x 12&apos;
de cor hand-tufted ivory / navy area rug , 8 &apos; x 10&apos;

Generic apostroph is not correctly tokenized

The apostroph is not correctly tokenized for other languages than english, french and italian.

Problem is line 171 in tokenize.py

NON_SPECIFIC_APOSTROPHE = r"\'", r" \' "

The second apostophe must not be escaped, because the backslash in the substitution is read as literal character and not as an escape character. The first backlash is not nessecary.

NON_SPECIFIC_APOSTROPHE = r"'", " ' "

should work.

LICENSE file

It would be nice to have a LICENSE file, so automatic parsers can read it

Thank you

normalize broken

Running 0.38 installed via pip:

$ echo "This is a test" | sacremoses normalize
Traceback (most recent call last):
  File "/usr/local/bin/sacremoses", line 10, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/sacremoses/cli.py", line 297, in normalize_file
    pre_replace_unicode_punct=replace_unicode_punct,
NameError: name 'replace_unicode_punct' is not defined

Tested in two environments (Mac and Linux).

detokenization does not add a space between Chinese/Japanese characters and non-CJK characters

The original Moses perl scripts add a space between tokens that do not end with a CJK character and tokens that do:
https://github.com/moses-smt/mosesdecoder/blob/555829a771cd897bb807f495a95737953a7ca9a3/scripts/tokenizer/detokenizer.perl#L109-L115

The current Python port only adds a space if a token starts with a CJK character and does not end with a CJK character:
https://github.com/alvations/sacremoses/blob/4d994b8781f6c10600d34413679e1a1acdb53cb5/sacremoses/tokenize.py#L692-L696

This seems like a mistake and I would expect the original behavior to be replicated.

detokenizer = MosesDetokenizer()
text = detokenizer.detokenize(['Japan', 'is', '日', '本', 'in', 'Japanese', '.'])
assert text == 'Japan is 日本 in Japanese.'
# it actually will currently return 'Japan is日本 in Japanese.' with no space before 日