Git Product home page Git Product logo

g2pk's Introduction

image image image

g2pK: g2p module for Korean

g2p means a task that converts graphemes to phonemes. Hangul, the main script for Korean, is phonetic, but the pronunciation rules are notoriously complicated. So it is never easy to learn how to read a text in Korean. That's why g2p is necessary in various nlp tasks like TTS. . There's a open source g2p library for Korean, KoG2P. It is simple and works well, but I think we need a better one. Please read through the following section (main features and usage) to understand the philosophy of g2pK and how to use g2pK. We know it is not perfect in present. That's one of the reasons your contributions are more than welcome.

Requirements

Installation

pip install g2pk

Main features & Usage

  • Returns text as it is pronounced, keeping punctuations.
>>> from g2pk import G2p
>>> g2p = G2p()
>>> g2p("어제는 날씨가 맑았는데, 오늘은 흐리다.")
어제는 날씨가 말간는데, 오느른 흐리다.
  • Determines pronunciation seeing context, thanks to Mecab, a morphological analyzer. In the following example, note that the first and second 신고 are pronounced differently.
>>> g2p("신을 신고 얼른 동사무소에 가서 혼인 신고 해라")
시늘 신꼬 얼른 동사무소에 가서 호닌 신고 해라
  • Returns two types of results, that is, prescriptive (default) and descriptive (with the option descriptive=True) pronunciation. For example, josa 의 is pronounced 의 in principle, but in real life, it is often pronounced 에. Also, 계 is much more often pronounced 게.
>>> sent = "나의 친구는 계산이 아주 빠르다"
>>> g2p(sent)
나의 친구는 계사니 아주 빠르다
>>> g2p(sent, descriptive=True)
나에 친구는 게사니 아주 빠르다
  • This distinction becomes more obvious if you set group_vowels=True. In contemporary colloquial speech, some vowels are hard to distinguish from each other. For example, in the example below, the vowel ㅒ is normalized to ㅖ.
>>> sent = "저는 예전에 그 얘기를 들은 적이 있습니다"
>>> g2p(sent)
저느 녜저네 그 얘기를 드른 저기 읻씀니다
>>> g2p(sent, group_vowels=True)
저느 녜저네 그 예기를 드른 저기 읻씀니다
  • By default, it returns the standard Korean script, where letters are assembled to form a syllable. If you set to_syl=False, however, it returns Hangul letters or jamo. This can be useful for many applications like speech synthesis. *Depending on the font you are using, the two results below may look the same, but actually they are not.
>>> sent = "어제는 날씨가 맑았는데, 오늘은 흐리다."
>>> g2p(sent)
어제는 날씨가 말간는데, 오느른 흐리다.
>>> g2p(sent, to_syl=False)
어제는 날씨가 말간는데, 오느른 흐리다.
>>> sent = "그 사람은 좀, old school 같아"
>>> g2p(sent)
그 사라믄 좀, 올드 스쿨 가타
  • Arabic numbers are spelled out to their context. Note that the first 12 is pronounced 열두, whereas the second 12 is pronounced 십이.
>>> sent = "지금 시각은 12시 12분입니다"
>>> g2p(sent)
지금 시가그 녈두시 시비부님니다
  • It is natural that rules can NOT cover every single case. Add special idioms to idioms.txt.
  • If you set verbose=True, you will see the conversion processes with relevant information.
>>> sent = "학교에 갔다 와서, 엄마가 해 주신 밥을 먹었다."
>>> g2p(sent, verbose=True)
학교에 갔다 와서, 엄마가 해 주신 밥을 먹었다. -> 학꾜에 갔다 와서, 엄마가 해 주신 밥을 먹었다.
 제23항 받침 'ㄱ(ㄲ, ㅋ, ㄳ, ㄺ), ㄷ(ㅅ, ㅆ, ㅈ, ㅊ, ㅌ), ㅂ(ㅍ, ㄼ, ㄿ, ㅄ)' 뒤에 연결되는 'ㄱ, ㄷ, ㅂ, ㅅ, ㅈ'은 된소리로 발음한다.
-> 국밥[국빱], 깎다[깍따], 넑받이[넉빠지], 삯돈[삭똔]
-> 닭장[닥짱], 칡범[칙뻠], 뻗대다[뻗때다], 옷고름[옫꼬름]
-> 있던[읻떤], 꽂고[꼳꼬], 꽃다발[꼳따발], 낯설다[낟썰다]
-> 밭갈이[받까리], 솥전[솓쩐], 곱돌[곱똘], 덮개[덥깨]
-> 옆집[엽찝], 넓죽하다[넙쭈카다], 읊조리다[읍쪼리다], 값지다[갑찌다] 
학꾜에 갔다 와서, 엄마가 해 주신 밥을 먹었다. -> 학꾜에 갇따 와서, 엄마가 해 주신 밥을 먹얻따.
 제9항 받침 'ㄲ, ㅋ', 'ㅅ, ㅆ, ㅈ, ㅊ, ㅌ', 'ㅍ'은 어말 또는 자음 앞에서 각각 대표음 [ㄱ, ㄷ, ㅂ]으로 발음한다.
-> 닦다[닥따], 키읔[키윽], 키읔과[키윽꽈], 옷[옫]
-> 웃다[욷따], 있다[읻따], 젖[젇], 빚다[빋따]
-> 꽃[꼳], 쫓다[쫃따], 솥[솓], 뱉다[밷따]
-> 앞[압], 덮다[덥따]
제23항 받침 'ㄱ(ㄲ, ㅋ, ㄳ, ㄺ), ㄷ(ㅅ, ㅆ, ㅈ, ㅊ, ㅌ), ㅂ(ㅍ, ㄼ, ㄿ, ㅄ)' 뒤에 연결되는 'ㄱ, ㄷ, ㅂ, ㅅ, ㅈ'은 된소리로 발음한다.
-> 국밥[국빱], 깎다[깍따], 넑받이[넉빠지], 삯돈[삭똔]
-> 닭장[닥짱], 칡범[칙뻠], 뻗대다[뻗때다], 옷고름[옫꼬름]
-> 있던[읻떤], 꽂고[꼳꼬], 꽃다발[꼳따발], 낯설다[낟썰다]
-> 밭갈이[받까리], 솥전[솓쩐], 곱돌[곱똘], 덮개[덥깨]
-> 옆집[엽찝], 넓죽하다[넙쭈카다], 읊조리다[읍쪼리다], 값지다[갑찌다] 
학꾜에 갇따 와서, 엄마가 해 주신 밥을 먹얻따. -> 학꾜에 갇따 와서, 엄마가 해 주신 바블 머걷따.
 제13항 홑받침이나 쌍받침이 모음으로 시작된 조사나 어미, 접미사와 결합되는 경우에는, 제 음가대로 뒤 음절 첫소리로 옮겨 발음한다.
-> 깎아[까까], 옷이[오시], 있어[이써], 낮이[나지]
-> 꽂아[꼬자], 꽃을[꼬츨], 쫓아[쪼차], 밭에[바테]
-> 앞으로[아프로], 덮이다[더피다] 

References

If you use our software for research, please cite:

@misc{park2019g2pk,
  author = {Park, Kyubyong},
  title = {g2pK},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Kyubyong/g2pk}}
}

g2pk's People

Contributors

huffon avatar jireh-father avatar kyubyong avatar rishubil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

g2pk's Issues

assertion Error

File "/home/ubuntu/.local/lib/python3.6/site-packages/mecab/mecab.py", line 62, in
for node in lattice
File "/home/ubuntu/.local/lib/python3.6/site-packages/mecab/mecab.py", line 33, in _extract_feature
assert len(values) == 8
AssertionError

I don't think g2pk is recognizing the mecab-ko-dic I have installed, since the extract feature won't extract 8 values if it isn't korean.
How do I specfiy the dictionary path so that g2pk correctly identifies korean?

g2p result report

Hi I am applying g2pk on my wordlist and I found strange cases of result. Hope they are helpful for improving the tool.

(1) where "1)" appears in the result.

닭칼국수        다1)알국쑤
밝칵    바1)악
닭칼    다1)알
진흙쿠키는      진흐1)우키는
닭칼국수랑      다1)알국쑤랑
읽킬수  이1)일수
밝켜낸  바1)여낸
밝키시게        바1)이시게
불닭케티는      불다1)에티는
진흙쿠키        진흐1)우키
닭칼국수는      다1)알국쑤는

(2) it also returns incorrect results for some type of words like 값있는, 값어

word        g2pk        reference
값있는        갑씬는        가빋따
값어치         갑써치        값어치

KoNLPy 디펜던시 삭제 및 버전 릴리즈 요청 드립니다.

안녕하세요! #11 풀 리퀘스트의 버지 혹은 다른 형태로 KoNLPy 디펜던시 삭제 및 버전 릴리즈 요청을 드립니다.

pip install g2pK 를 할 때 konlpy 가 설치되는데, g2pK 코드 내에서 konlpy 를 현재는 사용하지 않는 것으로 보입니다. ( #5 에서 교체됨 )

혹시 의존성 삭제 및 버전 릴리즈에 도울 수 있는 방법이 있다면 알려주시면 바로 기여하겠습니다..!

Segmentation fault

Hi! I have successfully installed this library using pip3 on my machine, which runs Ubuntu. However, when I try to use it, I always get segmentation fault (core dumped). What could be the causes and how might I solve this issue?

Is the rule 15 correctly implemented?

Dear contributors,
Thank you for your great works!

I have been trying to improve TTS quality while keeping the amount of the data unchanged.

I thought using the g2pk package would improve the model by reducing the number of the token being fed into the model by a significant amount, rule 8 for example, reducing 21(Jongsung) tokens to 7(Pronounceable Jongsung) tokens.

I combined GlowTTS, g2pk, and Multi-band MelGan and trained with the KSS dataset and acquired the following result.

G2PK Comparison Demo

It seems that g2pk grapheme tokens are much better than just using Jamo tokens!

Yet, I found that the g2pk conversion result is slightly different from how common Korean usually pronounce.

Since I am no expert in the Korean language, I referred to 한국어 어문 규범 and 부산대학교 표준발음 변환기.

For a sample sentence from the KSS,

Source Result
Original Sentence 저는 귀가 어두운데 다른 사람의 얘기를 아주 잘 들어 준다는 말을 많이 들어왔어요.
G2PK 저는 귀가 어두운데 다른 사라믜 얘기르 라주 잘 드러 준다는 마를 마니 드러와써요.
부산대학교 저는 귀가 어두운데 다른 사라메 얘기를 아주 잘 드러 준다는 마를 마니 드러와써요

I suggest that "르 라" is the problem since common Korean does not speak like that.

I found the following function in the source, regular.py.


def link3(inp, descriptive=False, verbose=False):
    rule = rule_id2text["15"]
    out = inp

    pairs = [ ("ᆨ ᄋ", " ᄀ"),
                  ...
              ("ᆹ ᄋ", "ᆸ ᄊ") ]

    for str1, str2 in pairs:
        out = out.replace(str1, str2)

    gloss(verbose, out, inp, rule)
    return out

From 한국어 어문 규범,
제15항 받침 뒤에 모음 ‘ㅏ, ㅓ, ㅗ, ㅜ, ㅟ’ 들로 시작되는 실질 형태소가 연결되는 경우에는, 대표음으로 바꾸어서 뒤 음절 첫소리로 옮겨 발음한다.

And it seems that you do not consider 실질 형태소 or 모음 ‘ㅏ, ㅓ, ㅗ, ㅜ, ㅟ’.

Is consideration being taken in other parts of the source?

If not, I would like to implement it by myself. Please let me know if you have already improved this part.


The real question is that the g2pk conversion result above is the correct answer according to 한국어 어문 규범!

"아주" starts with "ㅏ" and it is a 실질 형태소 and 대표음 of "ㄹ" from "를" is "ㄹ".

So, "를 아주" should be pronounced "르 라주" according to 한국어 어문 규범.

I have been thinking of this issue for several weeks, and I have concluded that Korean tends to attach a comma to space " " between letters when they think it is needed. "애기를 아주" becomes "애기를, 아주" to highlight the pronunciation of "아" as "아", to distinguish it from "라". Yet, I have not found any good algorithm to selectively apply rule 15 in accordance with my common sense.

As a quick fix, I just nullified the link3 and named it G2PK no 15 on the demo page.

I have already achieved a satisfactory experimental result, and it seems OK to extend my research on Phoneme and Grapheme alignment.

But, as I mentioned earlier, I am not a professional in Korean or any other Linguistics.

So, I would appreciate an opinion from the real linguist to properly improve my TTS results and G2P conversion.

So if you have any opinion regarding my questions, please share it.

Thanks.

영어 처리 과정에서 오류가 있습니다. [english.py]

eng_words = set(re.findall("[A-Za-z']+", string))

정규식 내에 따옴표 ' 때문에 영어 처리가 제대로 이루어지지 않습니다.
위를 그대로 사용할 경우
Input : the shawshank redemption'이다. 언뜻 생각하면 'escape'를 썼을 법한데 'redemption'을 썼다. redemption의 사전적 의미는 구원, 속죄, 회복이다.
output : 'the 쇼섄크 리뎀프션'이다. 언뜯 쌩가카면 'escape'를 써쓸 뻐판데 '리뎀프션'을 썯따. 리뎀프셔늬 사전저 긔미는 구원, 속쬐, 회보기다.

아무래도 대괄호를 닫는 과정에서 오타를 넣으신 것 같습니다.

따라서 eng_words = set(re.findall("[A-Za-z]+", string))로 수정하면 좋을 듯합니다.

수정 후 output : '더 쇼섄크 리뎀프션'이다. 언뜯 쌩가카면 '이스케이프'를 써쓸 뻐판데 '리뎀프션'을 썯따. 리뎀프셔늬 사전저 긔미는 구원, 속쬐, 회보기다.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.