Git Product home page Git Product logo

spoteno's Introduction

spoteno

PyPI

spoteno (Spoken-Text-Normalization) is a tool to cleanup text-transcripts for speech recognition systems. These systems normally expect target transcripts to contain only characters from a restricted set.

Installation

Install the latest development version:

pip install git+https://github.com/ynop/spoteno.git

Examples

The default usecase would be to normalize a sentence. This enforces the output string to contain only valid characters (as defined by the configuration).

import spoteno

sentence = ('Am 11. Januar geht er um 5m nach links,'
            'weshalb er $d schon "ziemlich" müde ist.')

norm = spoteno.Normalizer.de()
outsent = norm.normalize(sentence)
print(outsent)

# >>> am elfte januar geht er um fünf m nach links weshalb er d schon ziemlich müde ist

With force=False, the final cleanup can be disabled. This way invalid characters may occurr in the output, if the configuration hasn't handled them specifically.

outsent = norm.normalize(sentence, force=False)
print(outsent)

# >>> am elfte januar geht er um fünf m nach links weshalb er $d schon ziemlich müde ist

With the debug method, one can retrieve a set of invalid characters in the final output. This can be used to create or debug a configuration. Additionaly the outputs of the different configuration steps can be printed.

outsent, error = norm.debug(sentence)
print(error)

# >>> START               Am 11. Januar geht er um 5m nach links,weshalb er $d schon "ziemlich" müde ist.
# >>> Strip               ['Am 11. Januar geht er um 5m nach links,weshalb er $d schon "ziemlich" müde ist.']
# >>> Lower               ['am 11. januar geht er um 5m nach links,weshalb er $d schon "ziemlich" müde ist.']
# >>> StripChar           ['am 11. januar geht er um 5m nach links,weshalb er $d schon "ziemlich" müde ist']
# >>> ReplaceIfNotSurroundedByDigits['am 11. januar geht er um 5m nach links weshalb er $d schon "ziemlich" müde ist']
# >>> ReplaceIfNotPrecededByDigit['am 11. januar geht er um 5m nach links weshalb er $d schon "ziemlich" müde ist']
# >>> ReplaceRegex        ['am 11. januar geht er um 5m nach links weshalb er $d schon "ziemlich" müde ist']
# >>> ReplaceChar         ['am 11. januar geht er um 5m nach links weshalb er $d schon  ziemlich  müde ist']
# >>> ReplaceChar         ['am 11. januar geht er um 5m nach links weshalb er $d schon  ziemlich  müde ist']
# >>> WhitespaceTokenize  ['am', '11.', 'januar', 'geht', 'er', 'um', '5m', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> SplitNumberSuffix   ['am', '11.', 'januar', 'geht', 'er', 'um', '5', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> NumberToWords       ['am', '11.', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> OrdinalNumberToWords['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> ReplaceChar         ['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> ReplaceFull         ['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> RemoveDiacritics    ['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> Strip               ['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> END                 ['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']k

# >>> {'$'}

Development

Prerequisites

It's recommended to use a virtual environment when developing spoteno. To create one, execute the following command in the project's root directory:

python -m venv .

To install spoteno and all it's dependencies, execute:

pip install -e .

Running the test suite

pip install -e .[dev]
python setup.py test

With PyCharm you might have to change the default test runner. Otherwise, it might only suggest to use nose. To do so, go to File > Settings > Tools > Python Integrated Tools (on the Mac it's PyCharm > Preferences > Settings > Tools > Python Integrated Tools) and change the test runner to py.test.

Versions

Versions is handled using bump2version. To bump the version:

bump2version [major,minor,patch,release,num]

In order to directly go to a final relase version (skip .dev/.rc/...):

bump2version [major,minor,patch] --new-version x.x.x

Release

Commands to create a new release on pypi.

rm -rf build
rm -rf dist

python setup.py sdist
python setup.py bdist_wheel
twine upload dist/*

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.