pgolo / sic Goto Github PK
View Code? Open in Web Editor NEWUtility for string normalization
License: MIT License
Utility for string normalization
License: MIT License
When tokenization config indicates that "A" must be replaced with "B" and "B" must be replace with "C", such things should be interpreted as instruction to replace "A" with "C".
Implement an option not to introduce new spaces between identified tokens in the output and leave character case as in original string when no replacements have been made, even if configuration is case-insensitive.
E.g., when config case-insensitively requests token bad
to be replaced with good
, the result should be:
123-Bad-456-Badass
--> 123-good-456-Badass
(not 123 - good - 456 - badass
)
Use case: limiting sic
functionality to solely spelling correction.
Compiler warnings when cythonizing under Python 3.9
sic/core.c(9004): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
sic/core.c(11597): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
sic/core.c(12034): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
sic/core.c(12676): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
sic/core.c(13003): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
sic/core.c(13736): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
sic/core.c(17110): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
sic/core.c(17200): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
Because SplitToken instructions are executed before ReplaceToken instructions, the following usage case produces the result that does not seem right:
>>> import sic
>>> x = sic.Model()
>>> x.add_rule(sic.ReplaceToken('(P)', '[P]'))
>>> sic.build_normalizer(x)
>>>
>>> # The following is correct, as groups of brackets and alphabetical characters are getting split unconditionally
>>> sic.normalize('ab(c)')
'ab ( c )'
>>>
>>> # The following is incorrect, the right solution would rather seem like 'ab [p]'
>>> sic.normalize('ab(p)')
'ab ( p [p]'
The effect occurs regardless if SplitToken instruction is unconditional or explicitly set for the Model() class.
Another class for normalization instructions that allows joining substrings together.
Test case: the rule that shrinks non-
into non
and does not introduce word separator:
non compliant
--> non compliant
noncompliant
--> noncompliant
non-compliant
--> noncompliant
So that something like this would work:
import sic
sic.build_normalizer('path/to/config.xml')
result = sic.normalize('string to normalize')
There must be a way to build tokenization model directly by coding in Python like this (for example):
import sic
builder = sic.Builder()
model = builder.create_model()
model.add_rule(...)
machine = builder.build_normalizer(model)
Single normalization task should be available to run like this:
import sic
x = sic.normalize('string-to-normalize')
x = sic.normalize('string-to-normalize', tokenizer_config='path/to/config')
x = sic.normalize('string-to-normalize', tokenizer_config='path/to/config', word_separator=' ')
x = sic.normalize('string-to-normalize', tokenizer_config='path/to/config', word_separator=' ', normalizer_option=0)
# and so on
When there is a set of replace
instructions that conflict with each other but eventually converge:
A --> B
A --> C
B --> D
C --> D
sic
does not process them. Correct behavior would be infer the following:
A --> D
B --> D
C --> D
Unit test on spelling correction is not passing
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.