Git Product home page Git Product logo

sic's People

Contributors

pgolo avatar

Stargazers

 avatar  avatar

Watchers

 avatar

sic's Issues

Transitivity in tokenization config

When tokenization config indicates that "A" must be replaced with "B" and "B" must be replace with "C", such things should be interpreted as instruction to replace "A" with "C".

Implement "partial normalization"

Implement an option not to introduce new spaces between identified tokens in the output and leave character case as in original string when no replacements have been made, even if configuration is case-insensitive.

E.g., when config case-insensitively requests token bad to be replaced with good, the result should be:
123-Bad-456-Badass --> 123-good-456-Badass (not 123 - good - 456 - badass)

Use case: limiting sic functionality to solely spelling correction.

Type conversion when compiling in Python 3.9

Compiler warnings when cythonizing under Python 3.9

sic/core.c(9004): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
sic/core.c(11597): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
sic/core.c(12034): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
sic/core.c(12676): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
sic/core.c(13003): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
sic/core.c(13736): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
sic/core.c(17110): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
sic/core.c(17200): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'

Questionable behavior when ReplaceToken instruction has characters processed by SplitToken instruction

Because SplitToken instructions are executed before ReplaceToken instructions, the following usage case produces the result that does not seem right:

>>> import sic
>>> x = sic.Model()
>>> x.add_rule(sic.ReplaceToken('(P)', '[P]'))
>>> sic.build_normalizer(x)
>>>
>>> # The following is correct, as groups of brackets and alphabetical characters are getting split unconditionally
>>> sic.normalize('ab(c)')
'ab ( c )'
>>>
>>> # The following is incorrect, the right solution would rather seem like 'ab [p]'
>>> sic.normalize('ab(p)')
'ab ( p [p]'

The effect occurs regardless if SplitToken instruction is unconditional or explicitly set for the Model() class.

Joining substrings

Another class for normalization instructions that allows joining substrings together.
Test case: the rule that shrinks non- into non and does not introduce word separator:
non compliant --> non compliant
noncompliant --> noncompliant
non-compliant --> noncompliant

Implement ad hoc model building

There must be a way to build tokenization model directly by coding in Python like this (for example):

import sic
builder = sic.Builder()
model = builder.create_model()
model.add_rule(...)
machine = builder.build_normalizer(model)

Implement ad hoc normalization functionality

Single normalization task should be available to run like this:

import sic

x = sic.normalize('string-to-normalize')
x = sic.normalize('string-to-normalize', tokenizer_config='path/to/config')
x = sic.normalize('string-to-normalize', tokenizer_config='path/to/config', word_separator=' ')
x = sic.normalize('string-to-normalize', tokenizer_config='path/to/config', word_separator=' ', normalizer_option=0)
# and so on

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.