suminb / hanja Goto Github PK

View Code? Open in Web Editor NEW

125.0 6.0 14.0 710 KB

한글, 한자 라이브러리

Python 100.00%

python hanja hangul nlp

hanja's Introduction

hanja: 한자-한글 변환 라이브러리

한자-한글 변환기에서 사용되는 모듈입니다.

Improve Hanja Library

사용 하시다가 빠진 한자 또는 틀린 독음을 발견하시면 이 링크를 통해 제보해주세요. 확인 후 반영하도록 하겠습니다. GitHub을 통해 직접 PR을 보내주셔도 좋습니다.

Installation

pip install hanja

Usage

필요한 모듈 import 하기

>>> import hanja
>>> from hanja import hangul

한글 초성, 중성, 종성 분리

>>> hangul.separate('가')
(0, 0, 0)
>>> hangul.separate('까')
(1, 0, 0)

튜플(tuple)의 마지막 원소가 0이면 종성이 없는 글자라고 판단할 수 있다.

>>> hangul.separate('한')
(18, 0, 4)

'ㅎ'은 19번째 자음, 'ㅏ'는 첫번째 모음, 'ㄴ'은 다섯번째 자음이라는 것을 알 수 있다.

초성, 중성, 종성을 조합하여 한 글자를 만듦

>>> hangul.build(0, 0, 0)
'가'

주어진 글자가 한글인지의 여부를 판별

>>> hangul.is_hangul('가')
True
>>> hangul.is_hangul('a')
False

한글로 된 부분과 한자로 된 부분을 분리

리스트가 아닌 제네레이터(generator)를 반환한다.

>>> '|'.join(hanja.split_hanja('大韓民國은 **共和國이다.'))
大韓民國|은 |**共和國|이다.

>>> [x for x in hanja.split_hanja('大韓民國은 **共和國이다.')]
['大韓民國', '은 ', '**共和國', '이다.']

주어진 글자가 한자인지의 여부를 판별

>>> hanja.is_hanja('韓')
True

>>> hanja.is_hanja('한')
False

문장 변환

치환 모드 변환:

>>> hanja.translate('大韓民國은 **共和國이다.', 'substitution')
'대한민국은 민주공화국이다.'

혼용 모드 변환 (text):

>>> hanja.translate('大韓民國은 **共和國이다.', 'combination-text')
'大韓民國(대한민국)은 **共和國(민주공화국)이다.'

혼용 모드 변환 version 2 (text):

>>> hanja.translate('大韓民國은 **共和國이다.', 'combination-text-reversed')
'대한민국(大韓民國)은 민주공화국(**共和國)이다.'

혼용 모드 변환 (HTML):

>>> hanja.translate(u'大韓民國은 **共和國이다.', 'combination-html')
'<span class="hanja">大韓民國</span><span class="hangul">(대한민국)</span>은 <span class="hanja">**共和國</span><span class="hangul">(민주공화국)</span>이다.'

hanja's People

Contributors

Stargazers

Watchers

Forkers

dahlia littmus futurulus jihuichoi namuten full-kim goungoun askain justkode stannam 1980dragon seyoungsong passerbya hcy71o

hanja's Issues

두음법칙 오류

안녕하세요

龍潭의 한글변환값은 용담이 아닌 龍담으로 출력됨
麗川의 한글변환값은 여천이 아닌 麗천으로 출력됨

아마도 두음법칙의 문제가 있어 보입니다.
확인해주시면 감사하겠습니다.

혼용 모드 변환

The current combination translation mode (혼용 모드 변환) is designed for a particular web application. Consider employing a different interface to take a function to generate custom output.

Consolidate mode, string_format parameters

It is possible to specify a translation mode when calling translate().

hanja.translate('大韓民國은 **共和國이다.', 'substitution')
hanja.translate('大韓民國은 **共和國이다.', 'combination-text')

It is also possible to provide a custom translation mode by supplying format_string parameter.

hanja.translate('大韓民國은 **共和國이다.', 'combination-text', format_string='{hanja} {hangul}')

In such cases, the mode parameter does not serve any purposes.

I would like to revise translate() so it only takes format_string parameter, and we provide pre-defined format strings for existing translation modes (substitution, combination-text, combination-html). It will look like this:

hanja.translate('大韓民國은 **共和國이다.', Mode.substitution)
hanja.translate('大韓民國은 **共和國이다.', Mode.combination_text)
hanja.translate('大韓民國은 **共和國이다.', '{hanja} <{hangul}>')

Being aware of some hanjas' phonetic changes

Some hanjas like 金/讀/畵 can be pronounced in different ways. The current behavior can produce incorrect results in some cases e.g.:

Input: 金日成綜合大學은 平壤에 있는 朝鮮**主義人民共和國의 國立大學이다.
Expected output: 김일성종합대학은 평양에 있는 조선민주주의인민공화국의 국립대학이다.
Actual output: 금일성종합대학은 평양에 있는 조선민주주의인민공화국의 국립대학이다.

Hanja	Word 1	Word 2
金	金剛經 (금강경)	金浦國際空港 (김포국제공항)
讀	讀書 (독서)	句讀點 (구두점)
畵	畵龍點睛 (화룡점정)	企畵 (기획)

Combination Modes

We currently have combination mode where Hanja characters are converted into Hangul while preserving the original text in parentheses. Each class of characters are contained different <span> tags to differentiate semantics.

>>> hanja.translate(u'大韓民國은 **共和國이다.', 'combination')
<span class="hanja">大韓民國</span><span class="hangul">(대한민국)</span>은 <span class="hanja">**共和國</span><span class="hangul">(민주공화국)</span>이다.

However, I thought it would be useful to provide a text-only combination mode, assuming not everyone uses this library to produce HTML.

>>> hanja.translate(u'大韓民國은 **共和國이다.', 'combination-html')
<span class="hanja">大韓民國</span><span class="hangul">(대한민국)</span>은 <span class="hanja">**共和國</span><span class="hangul">(민주공화국)</span>이다.

>>> hanja.translate(u'大韓民國은 **共和國이다.', 'combination-text')
大韓民國(대한민국)은 **共和國(민주공화국)이다.

Backward compatibility may be preserved by making the legacy combination mode fall back to combination-html.

인식하지 못하는 한자

𤍠(\u24360) 𨽾(\u28F7E) 이런 한자들을 처리하게 하려면 어떻게 해야 하나요?

Documentation

Do some documentation!!

한자 모듈이 0.14.0로 업데이트 되면서 정상적으로 작동하지 않습니다.

PyPI로 올라가면서 impl.py 등의 파일이 누락된 것으로 보입니다.

>>> hanja.split_hanja("대한민국은 한자로 표현하면 大韓民國이다.")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/lib/python3.10/site-packages/hanja/__init__.py", line 38, in load_and_call
    mod = __import__(import_path)
ModuleNotFoundError: No module named 'hanja.impl'

미변환 한자 존재

안녕하세요

hanja 라이브러리를 사용 중에 변환되지 않는 한자를 발견했습니다.

input_text ='女fjdks南減朴a로롤로롤로 '
hanja.translate(input_text, 'substitution')  # 한자 -> 한글 치환
>>> '女fjdks남감박a로롤로롤로 '

위와 같이 계집녀 자가 변환이 안됩니다.

사용 데이터는 모두의말뭉치 뉴스데이터이고,
개발 환경은 ubuntu 18.03, python 3.8.3 입니다

hanaj-0.13.1, No such file or directory: 'requirements.txt'

C:\temp>py -m pip install hanja
Collecting hanja
  Using cached hanja-0.13.1.tar.gz (119 kB)
    ERROR: Command errored out with exit status 1:
     command: 'C:\Programs\Python3864\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\usrme\\AppData\\Local\\Temp\\pip-install-2zu2r50b\\hanja\\setup.py'"'"'; __file__='"'"'C:\\Users\\usrme\\AppData\\Local\\Temp\\pip-install-2zu2r50b\\hanja\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\usrme\AppData\Local\Temp\pip-install-2zu2r50b\hanja\pip-egg-info'
         cwd: C:\Users\usrme\AppData\Local\Temp\pip-install-2zu2r50b\hanja\
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\usrme\AppData\Local\Temp\pip-install-2zu2r50b\hanja\setup.py", line 17, in <module>
        with open("requirements.txt") as f:
    FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

0.13.1 을 pip 로 설치할 때, requirements.txt 가 없어서 위와 같이 에러가 발생합니다.

hanja-0.13.1.tar.gz 안에 해당 파일이 없습니다.

라이선스

이 라이브러리가 하는 일과 비슷한 기능을 하는 확장앱/확장 프로그램을 만들고 있는 한 사람입니다. 한자의 음을 굉장히 실하게 정리해놓은 table.yml을 활용하는(가져다 쓰는) 것에 관심이 있는데, 저를 포함한 다른 많은 사람들이 정당하게 그렇게 할 수 있는 것인지를 명시해줄 라이선스를 만들어놓는 것은 어떨까 제안합니다.
But, of course, "[y]ou're under no obligation to choose a license."

Two versions of the same Chinese character

Hi. It seems that the same Chinese character can have two versions, which look slightly different and also have different unicode values. And only one version is recognized as hanja.

For example, 李 has two versions, unicode 674e and unicode f9e1. Only the first version passes as hanja:

My guess, from looking at 李, 金, 宅, is that all unicode values f900-fa60 in the unicode tables (http://www.tamasoft.co.jp/en/general-info/unicode.html) suffer the same problem.

Would it be possible to include unicode values f900-fa60 to be recognized by hanja?

Thank you!

you need PyYAML in install_requires in setup.py for pypi package

pypi에 새 버전 릴리즈 해주세요

현재 pypi에 올라온 버전에서는 다음과 같은 오류가 발생합니다.

  File "translit.py", line 36, in tra
    input = hanja.translate(input, 'substitution')
  File "/private/tmp/.env/lib/python2.7/site-packages/hanja/hanja.py", line 46, in translate
    split_hanja(text)))
  File "/private/tmp/.env/lib/python2.7/site-packages/hanja/hanja.py", line 45, in <lambda>
    return ''.join(map(lambda w: translate_word(w, mode),
  File "/private/tmp/.env/lib/python2.7/site-packages/hanja/hanja.py", line 54, in translate_word
    tw = ''.join(map(translate_syllable, u' '+word[:-1], word))
  File "/private/tmp/.env/lib/python2.7/site-packages/hanja/hanja.py", line 14, in translate_syllable
    return dooeum(previous, hanja_table[current])
  File "/private/tmp/.env/lib/python2.7/site-packages/hanja/hangul.py", line 30, in dooeum
    p, c = Hangul.separate(previous), Hangul.separate(current)
NameError: global name 'Hangul' is not defined

테스트 해보려 하는데 에러를 뿜네요.

ImportError: cannot import name 'hangul' from partially initialized module 'hanja' (most likely due to a circular import

발음이 2종류 이상인 한자의 독음 문제

안녕하세요, 멋진 패키지 만들어 주셔서 감사합니다. 너무 잘 쓰고 있습니다.

저는 주로 중국원서를 읽을 때 가독성 향상 및 독서속도 증진을 위해서
원문 아래에 독음 가이드라인을 붙이는 용도로 사용중인데요.
(이렇게 하고 나서 독해속도가 2배 더 빨라졌습니다... 감사해요!)

한자가 발음이 2가지 이상인 경우 아무래도 잘못 읽는 경우가 많습니다.
패키지의 문제는 아니고 한자라는 체계의 근본적 한계라고 생각하지만요.
예를 들어서 '适合'를 Hanja로 읽으면 '괄합'이라고 나옵니다. '적합'이 맞는 발음인데요.

네이버 한자 사전을 보면 빠를 괄, 적합할 적 2가지 발음이 있는데
대부분의 경우 첫 번째 발음으로 출력되더라고요.
이런 경우가 적지 않다보니 조금 아쉬움이 있습니다.

개인적으로는 시작한지 2달 된 파이썬 실력을 어떻게든 쥐어짜서
사용자 사전을 만들어 사용하고 있습니다.
사전에 适:적 이렇게 넣으면 패키지 내부 사전을 덮어쓰면서 우선적용되게 했어요.
근데 제가 워낙 실력이 미천하다보니 코드도 그지같고 효율적이지도 않아서...
패키지 자체적으로 사용자사전 기능을 제공한다면 참 좋을 것 같습니다.

또 하나는 두음법칙 문제입니다.
중국어는 띄어쓰기가 없다보니 중국문서에 Hanja를 적용하면
문장 제일 처음에 올 때 외에는 전부 두음법칙을 적용받지 못합니다.

예를 들어서 이면세계 할 때 이면(里面)은 전부 다 리면으로 나오네요.
이런 것도 따로 수정할 방법이 있다면
(예를 들어 두음법칙 함수보다 우선순위가 높은 발음사전을 지원한다던가)
훨씬 더 유용성이 높아지지 않을까 싶습니다.

다시 한 번 좋은 모듈 공유해 주셔서 감사하다는 말씀 드리면서 이만 글 줄입니다.

Please, support python 3.x version.

It does not work properly in Python 3.5.
Is it only work on Python 2.x version?
Please, support python 3.x version.