tsroten / zhon Goto Github PK

View Code? Open in Web Editor NEW

355.0 355.0 45.0 226 KB

Constants used in Chinese text processing

License: MIT License

Python 100.00%

zhon's People

Stargazers

Watchers

zhon's Issues

pinyin pattern objects should not match non-pinyin character (e.g. numbers/punctuation)

The zhon.pinyin constants should not match numbers and punctuation. They should only match valid pinyin syllables. The user can then add punctuation/whitespace constants when compiling if necessary.

Make zhon not import zhon.* modules in init.py

Some of the zhon constants are memory intensive (e.g. CC-CEDICT constants). zhon should not automatically import its modules.

zhon being used in practice

Found a great place to use zhon's symbol lists. Parsing regular expressions out of UNIHAN.

https://github.com/cihai/unihan-etl

https://github.com/cihai/unihan-etl/blob/335441a/unihan_etl/expansion.py

Thanks for the project

Wrong zhuyin to pinyin for syllables ending with ㄨ

@mthewissen reports:

That is, when they have the first intonation:

from dragonmapper import hanzi, transcriptions
print transcriptions.zhuyin_syllable_to_pinyin(u'ㄓㄨˋ') # works
print transcriptions.zhuyin_syllable_to_pinyin(u'ㄔㄨ') # does not work

I traced things down to the following:

def _parse_zhuyin_syllable(unparsed_syllable):
    """Return the syllable and tone of a Zhuyin syllable."""
    zhuyin_tone = unparsed_syllable[-1]
    if zhuyin_tone in zhon.zhuyin.characters:
        syllable, tone = unparsed_syllable, '1'
    elif zhuyin_tone in zhon.zhuyin.marks:
        for tone_number, tone_mark in _ZHUYIN_TONES.items():
            if zhuyin_tone == tone_mark:
                syllable, tone = unparsed_syllable[:-1], tone_number
    else:
        raise ValueError("Invalid syllable: %s" % unparsed_syllable)

    return syllable, tone

For some reason, there is no ㄨ in zhon.zhuyin.characters? (also no ㄩ)

Ideographic number zero

zhon.hanzi.characters doesn't currently include U+3007, 〇. It's not a CJK Unified Ideograph, but it's present in 《现代汉语词典》and CC-CEDICT. zhon.hanzi.characters already includes some characters that aren't unified and the documentation doesn't claim that only unified characters are allowed. Instead, it's described as containing "pertinent CJK ideograph Unicode blocks". I think it's worth extending it to include U+3007.

pinyin regex matches junk

re.findall(zhon.pinyin.numbered_syllable, 'foo bar', re.IGNORECASE)
...['fo', 'o', 'ba', 'r']

re.findall(zhon.pinyin.syllable, 'foo bar', re.IGNORECASE)
... ['fo', 'o', 'ba', 'r']

same for accented_syllable

I was hoping for a way to match only pinyin within mixed English text.

Add Pinyin RE pattern object for numbered/accented

Pinyin is not simply A-Z, a-z, 1-5. It has a certain structure to it. SYL#SYL#... There should be a constant for this.

Typo in README.rst - Narrow Python Builds

When referring to the character \U00020000, tilde marks should be put around it so that the backslash is not hidden.

AttributeError: module 'zhon' has no attribute 'hanzi'

Add Hyperlink in README.RST

In the description of Zhon, RE pattern object should link like this: RE pattern object.

why I run re.findall('[%s]' % zhon.hanzi.characters, 'I broke a plate: 我打破了一个盘子.'), I got an empty []

why I run re.findall('[%s]' % zhon.hanzi.characters, 'I broke a plate: 我打破了一个盘子.'), I got an empty []???

I use python 2.7.3, ubuntu 12.04

'r' suffix bug

There is a typo in zhon.pinyin that makes the r-suffix (as used in erhua-style Chinese) combine with its previous syllable in the regular expression. Based on phonetics alone, it should be combined to form one syllable. However, because it is represented by an additional character it is unwise to treat it as one syllable. Doing so will create problems when using zhon.pinyin to interact with Chinese characters, which is likely a common scenario for users.

For example, huar is parsed as one syllable. In hua1r5 the r-suffix is ignored altogether.

Typo in README.rst - Bugs/Feature Requests

The link to Zhon's GitHub issues page is not formatted correctly.

Typo in README.rst - Narrow Python Builds

sys.maxunicode value should be changed to look like sys.maxunicode.

Add logging to build_string

The build_string function needs logging.

README.rst zhon.pinyin.RE_NUMBER typo

Under the section zhon.pinyin.RE_NUMBER, "expression" is spelled wrong.

`DeprecationWarning: invalid escape sequence`

This raises in pytest as of 1.1.5, on Python 3.10:

~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:40
  ~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:40: DeprecationWarning: invalid escape sequence '\]'
    non_stops = """"#$%&'()*+,-/:;<=>@[\]^_`{|}~"""

~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:153
  ~/.cache/...0/lib/python3.10/site-packages/zhon/pinyin.py:153: DeprecationWarning: invalid escape sequence '\]'
    """[%(stops)s]['"\]\}\)]*"""

~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:154
  ~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:154: DeprecationWarning: invalid escape sequence '\-'
    ) % {'word': word, 'non_stops': non_stops.replace('-', '\-'),

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

Pinyin regular expressions don't support Latin alpha for a

zhon.pinyin.vowels includes the Latin alpha that is sometimes used instead of a normal a. The regular expressions should support it as well.

zhon.pinyin.syl assumes non-combining diacritics

If the string separates e.g. the ǎ in xiǎo into 'a\u030c' rather than as one codepoint '\u01ce', which is rendered identically, the regex fails. These separate diacritics occur if you use Unicode normalization NFKD.

I suppose one solution is to duplicate for each way to represent ǎ (ditto for the others), perhaps programmatically generate the two options. Or maybe this just needs a note in the docs?

AttributeError: module 'zhon' has no attribute 'hanzi'

got no attribute hanzi,

import zhon
zhon.hanzi
Traceback (most recent call last):
File "", line 1, in
AttributeError: module 'zhon' has no attribute 'hanzi'

I don't know why seems the package install correct.

Same with others

I use windows 10 anaconda 3

Add missing Pinyin unicode code points

There are some Pinyin code points missing that need to be accounted for.

Update SIMPLIFIED and TRADITIONAL to allow for mapping between them

Currently, zhon.cedict.TRADITIONAL and zhon.cedict.SIMPLIFIED are strings consisting of each character that occurs in CC-CEDICT. It would be better if they contained a few duplicate characters but the characters in each constant were ordered in a way that allowed mapping between the constants.

>>> zhon.cedict.TRADITIONAL[5689]
'你'
>>> zhon.cedict.SIMPLIFIED[5689]
'你'
>>> zhon.cedict.TRADITIONAL[7899]
'國'
>>> zhon.cedict.SIMPLIFIED[7899]
'国'

Add tests for all zhon.unicode contstants

Tests are still needed for:

zhon.unicode.PINYIN
zhon.unicode.ZHUYIN
zhon.unicode.ASCII
zhon.unicode.RADICALS
zhon.unicode.PUNCTUATION

Pinyin.syllable truncates 'beì' (accented_syllable, and word)

Using the zhon.pinyon package, the pinyin (utf-8) syllable 'beì' in a regular expression, does not match perfectly. In fact, beì gets truncated to 'e', despite all other pinyin tried so far (~1,000).

Here's an example from Terminal in Mac OS X 10.11.6 and Python 2.7:

>>> import re
>>> import zhon.pinyin
>>> ln = u'南 ná 無 mó 善 shàn 臂 beì 菩薩 pú sà'
>>> pyo = re.findall(zhon.pinyin.syllable, ln)
>>> pyo
[u'n\xe1', u'm\xf3', u'sh\xe0n', u'e', u'p\xfa', u's\xe0']
                                                  ^ missing 'b' and 'ì'

tsroten / zhon Goto Github PK

zhon's People

Stargazers

Watchers

Forkers

zhon's Issues

Recommend Projects

Recommend Topics

Recommend Org