Git Product home page Git Product logo

normality's Introduction

normality text cleanup

build

Normality is a Python micro-package that contains a small set of text normalization functions for easier re-use. These functions accept a snippet of unicode or utf-8 encoded text and remove various classes of characters, such as diacritics, punctuation etc. This is useful as a preparation to further text analysis.

WARNING: This library works much better when used in combination with pyicu, a Python binding for the International Components for Unicode C library. ICU provides much better text transliteration than the default text-unidecode.

Example

# coding: utf-8
from normality import normalize, slugify, collapse_spaces

text = normalize('Nie wieder "Grüne Süppchen" kochen!')
assert text == 'nie wieder grune suppchen kochen'

slug = slugify('My first blog post!')
assert slug == 'my-first-blog-post'

text = 'this \n\n\r\nhas\tlots of \nodd spacing.'
assert collapse_spaces(text) == 'this has lots of odd spacing.'

License

normality is open source, licensed under a standard MIT license (included in this repository as LICENSE).

normality's People

Contributors

amdmi3 avatar dependabot-preview[bot] avatar kolanich avatar pmlandwehr avatar pombredanne avatar pudo avatar rosencrantz avatar shadchin avatar sunu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

normality's Issues

Tests fail with PyICU==2.8

Copied from : NixOS/nixpkgs#143838 (review)

regressions with normality tests, they look legitimate

https://github.com/NixOS/nixpkgs/pull/143838

6 packages failed to build:
python38Packages.fingerprints python38Packages.normality python38Packages.scancode-toolkit python39Packages.fingerprints python39Packages.normality python39Packages.scancode-toolkit

6 packages built:
gramps python38Packages.PyICU python38Packages.slob python39Packages.PyICU python39Packages.slob xdxf2slob

Upgrading normality to 2.2.3 doesn't help

__________________________ NormalityTest.test_german ___________________________

self = <tests.test_normality.NormalityTest testMethod=test_german>

    def test_german(self):
        text = u"Häschen Spaß"
>       self.assertEqual("Haschen Spass", ascii_text(text))
E       AssertionError: 'Haschen Spass' != 'Haschen Spa<-ss->'
E       - Haschen Spass
E       + Haschen Spa<-ss->
E       ?            ++  ++

tests/test_normality.py:48: AssertionErro

datafreeze dependency issue

Collecting pyicu>=1.9.3 (from normality>=0.5.1->dataset==1.1.0->-r requirements.txt (line 16))
  Downloading https://files.pythonhosted.org/packages/c2/15/0af20b540c828943b6ffea5677c86e908dcac108813b522adebb75c827c1/PyICU-2.2.tar.gz (211kB)
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-nisBrM/pyicu/setup.py", line 53, in <module>
        ''')
    RuntimeError:
    Please set the ICU_VERSION environment variable to the version of
    ICU you have installed.

Trying to pip install with dataset==1.1.0 is resulting in this error from the python:2.7-alpine Docker image.

Strange behavior of normalize() with \t character

Observing strange behavior for this example:

from normality import normalize

text = normalize('this \n\n\r\nhas\tlots of \nodd spacing.')
print(text)
# 'this haslots of odd spacing'

Notice how "has" and "lots" joined into a single word
Is it the expected behavior?

BTW, thank you for amazing work on the package!

Test failures in 2.5.0 with Python 3.12

test_guess_encoding, test_petro_iso_encoded, test_predict_encoding are failing in 2.5.0 with Python 3.12:

============================= test session starts ==============================
platform linux -- Python 3.12.0, pytest-7.4.2, pluggy-1.3.0
rootdir: /builddir/build/BUILD/normality-2.5.0
collected 19 items
tests/test_normality.py .....F..F.F....                                  [ 78%]
tests/test_paths.py ..                                                   [ 89%]
tests/test_scripts.py ..                                                 [100%]
=================================== FAILURES ===================================
______________________ NormalityTest.test_guess_encoding _______________________
self = <tests.test_normality.NormalityTest testMethod=test_guess_encoding>
    def test_guess_encoding(self):
        text = u"Порошенко Петро Олексійович"
        encoded = text.encode("iso-8859-5")
        out = guess_encoding(encoded)
>       self.assertEqual("iso8859-5", out)
E       AssertionError: 'iso8859-5' != 'cp1006'
E       - iso8859-5
E       + cp1006
tests/test_normality.py:72: AssertionError
_____________________ NormalityTest.test_petro_iso_encoded _____________________
self = <tests.test_normality.NormalityTest testMethod=test_petro_iso_encoded>
    def test_petro_iso_encoded(self):
        text = u"Порошенко Петро Олексійович"
        encoded = text.encode("iso8859-5")
        out = stringify(encoded)
>       self.assertEqual(text, out)
E       AssertionError: 'Порошенко Петро Олексійович' != 'ﺟﻐﻓﻐﻟﻁﻏﻌﻐ ﺟﻁﻗﻓﻐ ﺝﻍﻁﻌﻕﺉﻋﻐﺻﻊﻝ'
E       - Порошенко Петро Олексійович
E       + ﺟﻐﻓﻐﻟﻁﻏﻌﻐ ﺟﻁﻗﻓﻐ ﺝﻍﻁﻌﻕﺉﻋﻐﺻﻊﻝ
tests/test_normality.py:94: AssertionError
_____________________ NormalityTest.test_predict_encoding ______________________
self = <tests.test_normality.NormalityTest testMethod=test_predict_encoding>
    def test_predict_encoding(self):
        text = u"Порошенко Петро Олексійович"
        encoded = text.encode("iso-8859-5")
        out = predict_encoding(encoded)
>       self.assertEqual("iso8859-5", out)
E       AssertionError: 'iso8859-5' != 'cp1006'
E       - iso8859-5
E       + cp1006
tests/test_normality.py:78: AssertionError
=============================== warnings summary ===============================
tests/test_normality.py::NormalityTest::test_guess_encoding
  /builddir/build/BUILD/normality-2.5.0/normality/encoding.py:76: DeprecationWarning: guess_encoding is now deprecated. Use predict_encoding instead
    warnings.warn(
tests/test_normality.py::NormalityTest::test_guess_file_encoding
  /builddir/build/BUILD/normality-2.5.0/normality/encoding.py:95: DeprecationWarning: guess_encoding is now deprecated. Use predict_encoding instead
    warnings.warn(
tests/test_normality.py::NormalityTest::test_guess_file_encoding
  /builddir/build/BUILD/normality-2.5.0/normality/encoding.py:41: DeprecationWarning: normalize_result is now deprecated. Use tidy_result instead
    warnings.warn(
tests/test_normality.py::NormalityTest::test_guess_file_encoding
  /builddir/build/BUILD/normality-2.5.0/normality/encoding.py:16: DeprecationWarning: normalize_encoding is now deprecated. Use tidy_encoding instead
    warnings.warn(
tests/test_normality.py::NormalityTest::test_stringify_datetime
  /builddir/build/BUILD/normality-2.5.0/tests/test_normality.py:64: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
    dt = datetime.utcnow()
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/test_normality.py::NormalityTest::test_guess_encoding - Assertio...
FAILED tests/test_normality.py::NormalityTest::test_petro_iso_encoded - Asser...
FAILED tests/test_normality.py::NormalityTest::test_predict_encoding - Assert...
=================== 3 failed, 16 passed, 5 warnings in 0.21s ===================

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.