tsroten / zhon Goto Github PK
View Code? Open in Web Editor NEWConstants used in Chinese text processing
License: MIT License
Constants used in Chinese text processing
License: MIT License
The zhon.pinyin
constants should not match numbers and punctuation. They should only match valid pinyin syllables. The user can then add punctuation/whitespace constants when compiling if necessary.
Some of the zhon constants are memory intensive (e.g. CC-CEDICT constants). zhon should not automatically import its modules.
Found a great place to use zhon's symbol lists. Parsing regular expressions out of UNIHAN.
https://github.com/cihai/unihan-etl
https://github.com/cihai/unihan-etl/blob/335441a/unihan_etl/expansion.py
Thanks for the project
That is, when they have the first intonation:
from dragonmapper import hanzi, transcriptions
print transcriptions.zhuyin_syllable_to_pinyin(u'ㄓㄨˋ') # works
print transcriptions.zhuyin_syllable_to_pinyin(u'ㄔㄨ') # does not work
I traced things down to the following:
def _parse_zhuyin_syllable(unparsed_syllable):
"""Return the syllable and tone of a Zhuyin syllable."""
zhuyin_tone = unparsed_syllable[-1]
if zhuyin_tone in zhon.zhuyin.characters:
syllable, tone = unparsed_syllable, '1'
elif zhuyin_tone in zhon.zhuyin.marks:
for tone_number, tone_mark in _ZHUYIN_TONES.items():
if zhuyin_tone == tone_mark:
syllable, tone = unparsed_syllable[:-1], tone_number
else:
raise ValueError("Invalid syllable: %s" % unparsed_syllable)
return syllable, tone
For some reason, there is no ㄨ in zhon.zhuyin.characters
? (also no ㄩ)
zhon.hanzi.characters
doesn't currently include U+3007, 〇
. It's not a CJK Unified Ideograph, but it's present in 《现代汉语词典》and CC-CEDICT. zhon.hanzi.characters
already includes some characters that aren't unified and the documentation doesn't claim that only unified characters are allowed. Instead, it's described as containing "pertinent CJK ideograph Unicode blocks". I think it's worth extending it to include U+3007.
re.findall(zhon.pinyin.numbered_syllable, 'foo bar', re.IGNORECASE)
...['fo', 'o', 'ba', 'r']
re.findall(zhon.pinyin.syllable, 'foo bar', re.IGNORECASE)
... ['fo', 'o', 'ba', 'r']
same for accented_syllable
I was hoping for a way to match only pinyin within mixed English text.
Pinyin is not simply A-Z, a-z, 1-5. It has a certain structure to it. SYL#SYL#... There should be a constant for this.
When referring to the character \U00020000
, tilde marks should be put around it so that the backslash is not hidden.
AttributeError: module 'zhon' has no attribute 'hanzi'
In the description of Zhon, RE pattern object should link like this: RE pattern object.
why I run re.findall('[%s]' % zhon.hanzi.characters, 'I broke a plate: 我打破了一个盘子.')
, I got an empty []
???
I use python 2.7.3, ubuntu 12.04
There is a typo in zhon.pinyin
that makes the r-suffix (as used in erhua-style Chinese) combine with its previous syllable in the regular expression. Based on phonetics alone, it should be combined to form one syllable. However, because it is represented by an additional character it is unwise to treat it as one syllable. Doing so will create problems when using zhon.pinyin
to interact with Chinese characters, which is likely a common scenario for users.
For example, huar
is parsed as one syllable. In hua1r5
the r-suffix is ignored altogether.
The link to Zhon's GitHub issues page is not formatted correctly.
sys.maxunicode value should be changed to look like sys.maxunicode
.
The build_string
function needs logging.
Under the section zhon.pinyin.RE_NUMBER
, "expression" is spelled wrong.
This raises in pytest as of 1.1.5, on Python 3.10:
~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:40
~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:40: DeprecationWarning: invalid escape sequence '\]'
non_stops = """"#$%&'()*+,-/:;<=>@[\]^_`{|}~"""
~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:153
~/.cache/...0/lib/python3.10/site-packages/zhon/pinyin.py:153: DeprecationWarning: invalid escape sequence '\]'
"""[%(stops)s]['"\]\}\)]*"""
~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:154
~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:154: DeprecationWarning: invalid escape sequence '\-'
) % {'word': word, 'non_stops': non_stops.replace('-', '\-'),
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
zhon.pinyin.vowels
includes the Latin alpha that is sometimes used instead of a normal a
. The regular expressions should support it as well.
If the string separates e.g. the ǎ in xiǎo into 'a\u030c' rather than as one codepoint '\u01ce', which is rendered identically, the regex fails. These separate diacritics occur if you use Unicode normalization NFKD.
I suppose one solution is to duplicate for each way to represent ǎ (ditto for the others), perhaps programmatically generate the two options. Or maybe this just needs a note in the docs?
There are some Pinyin code points missing that need to be accounted for.
Currently, zhon.cedict.TRADITIONAL and zhon.cedict.SIMPLIFIED are strings consisting of each character that occurs in CC-CEDICT. It would be better if they contained a few duplicate characters but the characters in each constant were ordered in a way that allowed mapping between the constants.
>>> zhon.cedict.TRADITIONAL[5689]
'你'
>>> zhon.cedict.SIMPLIFIED[5689]
'你'
>>> zhon.cedict.TRADITIONAL[7899]
'國'
>>> zhon.cedict.SIMPLIFIED[7899]
'国'
Tests are still needed for:
Using the zhon.pinyon package, the pinyin (utf-8) syllable 'beì' in a regular expression, does not match perfectly. In fact, beì gets truncated to 'e', despite all other pinyin tried so far (~1,000).
Here's an example from Terminal in Mac OS X 10.11.6 and Python 2.7:
>>> import re
>>> import zhon.pinyin
>>> ln = u'南 ná 無 mó 善 shàn 臂 beì 菩薩 pú sà'
>>> pyo = re.findall(zhon.pinyin.syllable, ln)
>>> pyo
[u'n\xe1', u'm\xf3', u'sh\xe0n', u'e', u'p\xfa', u's\xe0']
^ missing 'b' and 'ì'
In the README, Zhon is described as a "module", instead of a "package", which is a better description.
zhon.pinyin.word
and it's related constants should not include numbers (expressing quantity, not tone) in the regular expression pattern. While Pinyin sentences might have numbers in them, individual words should not.
In order to give users an idea of how to use Zhon's constants, some examples are needed.
《 · 〈 〉﹑ ﹔ this symbols are missing in punctuation list of hanzi
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.