pgolo / sic Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 0.0 9.54 MB

Utility for string normalization

License: MIT License

Shell 2.55% Batchfile 2.67% Python 90.67% Cython 4.11%

natural-language-processing nlp rule-based-nlp string-normalization text-normalization tokenization

sic's People

Contributors

Stargazers

Watchers

sic's Issues

Transitivity in tokenization config

When tokenization config indicates that "A" must be replaced with "B" and "B" must be replace with "C", such things should be interpreted as instruction to replace "A" with "C".

There must be a way to instantly add tokenization rule to a compiled model

Implement "partial normalization"

Implement an option not to introduce new spaces between identified tokens in the output and leave character case as in original string when no replacements have been made, even if configuration is case-insensitive.

E.g., when config case-insensitively requests token bad to be replaced with good, the result should be:
123-Bad-456-Badass --> 123-good-456-Badass (not 123 - good - 456 - badass)

Use case: limiting sic functionality to solely spelling correction.

Build wheel for Python 3.9

Type conversion when compiling in Python 3.9

Compiler warnings when cythonizing under Python 3.9

sic/core.c(9004): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
sic/core.c(11597): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
sic/core.c(12034): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
sic/core.c(12676): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
sic/core.c(13003): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
sic/core.c(13736): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
sic/core.c(17110): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
sic/core.c(17200): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'

Questionable behavior when ReplaceToken instruction has characters processed by SplitToken instruction

Because SplitToken instructions are executed before ReplaceToken instructions, the following usage case produces the result that does not seem right:

>>> import sic
>>> x = sic.Model()
>>> x.add_rule(sic.ReplaceToken('(P)', '[P]'))
>>> sic.build_normalizer(x)
>>>
>>> # The following is correct, as groups of brackets and alphabetical characters are getting split unconditionally
>>> sic.normalize('ab(c)')
'ab ( c )'
>>>
>>> # The following is incorrect, the right solution would rather seem like 'ab [p]'
>>> sic.normalize('ab(p)')
'ab ( p [p]'

The effect occurs regardless if SplitToken instruction is unconditional or explicitly set for the Model() class.

Joining substrings

Another class for normalization instructions that allows joining substrings together.
Test case: the rule that shrinks non- into non and does not introduce word separator:
non compliant --> non compliant
noncompliant --> noncompliant
non-compliant --> noncompliant

Implement implicit instantiation of Builder and Normalizer classes

So that something like this would work:

import sic
sic.build_normalizer('path/to/config.xml')
result = sic.normalize('string to normalize')

Implement ad hoc model building

There must be a way to build tokenization model directly by coding in Python like this (for example):

import sic
builder = sic.Builder()
model = builder.create_model()
model.add_rule(...)
machine = builder.build_normalizer(model)

Builder must be able to pickle/unpickle sdata structure to save/load precompiled normalization units

Implement ad hoc normalization functionality

Single normalization task should be available to run like this:

import sic

x = sic.normalize('string-to-normalize')
x = sic.normalize('string-to-normalize', tokenizer_config='path/to/config')
x = sic.normalize('string-to-normalize', tokenizer_config='path/to/config', word_separator=' ')
x = sic.normalize('string-to-normalize', tokenizer_config='path/to/config', word_separator=' ', normalizer_option=0)
# and so on

Conflicting instructions are not resolved with transitive inference

When there is a set of replace instructions that conflict with each other but eventually converge:

A --> B
A --> C
B --> D
C --> D

sic does not process them. Correct behavior would be infer the following:

A --> D
B --> D
C --> D

Spelling correction is not working as expected

Unit test on spelling correction is not passing

pgolo / sic Goto Github PK

sic's People

Contributors

Stargazers

Watchers

sic's Issues

Recommend Projects

Recommend Topics

Recommend Org