Git Product home page Git Product logo

jiaeyan / jiayan Goto Github PK

View Code? Open in Web Editor NEW
555.0 11.0 71.0 347 KB

甲言,专注于古代汉语(古汉语/古文/文言文/文言)处理的NLP工具包,支持文言词库构建、分词、词性标注、断句和标点。Jiayan, the 1st NLP toolkit designed for Classical Chinese, supports lexicon construction, tokenizing, POS tagging, sentence segmentation and punctuation.

License: MIT License

Python 100.00%
classical-chinese nlp ancient-chinese

jiayan's Introduction

甲言Jiayan

PyPI License

中文
English

简介

甲言,取「甲骨文言」之意,是一款专注于古汉语处理的NLP工具包。
目前通用的汉语NLP工具多以现代汉语为核心语料,对古代汉语的处理效果并不如人意(详见分词)。本项目的初衷,便是辅助古汉语信息处理,帮助有志于挖掘古文化矿藏的古汉语学者、爱好者等更好地分析和利用文言资料,从「文化遗产」中创造出「文化新产」。
当前版本支持词库构建自动分词词性标注文言句读标点五项功能,更多功能正在开发中。

功能

  • 词库构建
  • 分词
    • 利用无监督、无词典的N元语法隐马尔可夫模型进行古汉语自动分词。
    • 利用词库构建功能产生的文言词典,基于有向无环词图、句子最大概率路径和动态规划算法进行分词。
  • 词性标注
  • 断句
    • 基于字符的条件随机场的序列标注,引入点互信息及t-测试值为特征,对文言段落进行自动断句。
  • 标点
    • 基于字符的层叠式条件随机场的序列标注,在断句的基础上对文言段落进行自动标点。
  • 文白翻译
  • 注意:受语料影响,目前不支持繁体。如需处理繁体,可先用OpenCC将输入转换为简体,再将结果转化为相应繁体(如港澳台等)。

安装

$ pip install jiayan 
$ pip install https://github.com/kpu/kenlm/archive/master.zip

使用

以下各模块的使用方法均来自examples.py

  1. 下载模型并解压:百度网盘,提取码:p0sc

    • jiayan.klm:语言模型,主要用来分词,以及句读标点任务中的特征提取;
    • pos_model:CRF词性标注模型;
    • cut_model:CRF句读模型;
    • punc_model:CRF标点模型;
    • 庄子.txt:用来测试词库构建的庄子全文。
  2. 词库构建

    from jiayan import PMIEntropyLexiconConstructor
    
    constructor = PMIEntropyLexiconConstructor()
    lexicon = constructor.construct_lexicon('庄子.txt')
    constructor.save(lexicon, '庄子词库.csv')
    

    结果:

    Word,Frequency,PMI,R_Entropy,L_Entropy
    之,2999,80,7.944909328101839,8.279435615456894
    而,2089,80,7.354575005231323,8.615211168836439
    不,1941,80,7.244331150611089,6.362131306822925
    ...
    天下,280,195.23602384978196,5.158574399464853,5.24731990592901
    圣人,111,150.0620531154239,4.622606551534004,4.6853474419338585
    万物,94,377.59805590304126,4.5959107835319895,4.538837960294887
    天地,92,186.73504238078462,3.1492586603863617,4.894533538722486
    孔子,80,176.2550051738876,4.284638190120882,2.4056390622295662
    庄子,76,169.26227942514097,2.328252899085616,2.1920058354921066
    仁义,58,882.3468468468468,3.501609497059026,4.96900162987599
    老聃,45,2281.2228260869565,2.384853500510039,2.4331958387289765
    ...
    
  3. 分词

    1. 字符级隐马尔可夫模型分词,效果符合语感,建议使用,需加载语言模型 jiayan.klm

      from jiayan import load_lm
      from jiayan import CharHMMTokenizer
      
      text = '是故内圣外王之道,暗而不明,郁而不发,天下之人各为其所欲焉以自为方。'
      
      lm = load_lm('jiayan.klm')
      tokenizer = CharHMMTokenizer(lm)
      print(list(tokenizer.tokenize(text)))
      

      结果:
      ['是', '故', '内圣外王', '之', '道', ',', '暗', '而', '不', '明', ',', '郁', '而', '不', '发', ',', '天下', '之', '人', '各', '为', '其', '所', '欲', '焉', '以', '自', '为', '方', '。']

      由于古汉语没有公开分词数据,无法做效果评估,但我们可以通过不同NLP工具对相同句子的处理结果来直观感受本项目的优势:

      试比较 LTP (3.4.0) 模型分词结果:
      ['是', '故内', '圣外王', '之', '道', ',', '暗而不明', ',', '郁', '而', '不', '发', ',', '天下', '之', '人', '各', '为', '其', '所', '欲', '焉以自为方', '。']

      再试比较 HanLP 分词结果:
      ['是故', '内', '圣', '外', '王之道', ',', '暗', '而', '不明', ',', '郁', '而', '不', '发', ',', '天下', '之', '人', '各为其所欲焉', '以', '自为', '方', '。']

      可见本工具对古汉语的分词效果明显优于通用汉语NLP工具。

      *更新:感谢HanLP的作者hankc告知——从2021年初,HanLP发布了深度学习驱动的2.x。由于使用了大规模语料上预训练的语言模型,这些语料已经包括了互联网上几乎所有的古汉语和现代汉语,所以在古汉语上的效果已经得到了质的提升。不仅仅是分词,就连词性标注和语义分析也有一定zero-shot learning的效果。相应的具体分词效果请参见该Issue

    2. 词级最大概率路径分词,基本以字为单位,颗粒度较粗

      from jiayan import WordNgramTokenizer
      
      text = '是故内圣外王之道,暗而不明,郁而不发,天下之人各为其所欲焉以自为方。'
      tokenizer = WordNgramTokenizer()
      print(list(tokenizer.tokenize(text)))
      

      结果:
      ['是', '故', '内', '圣', '外', '王', '之', '道', ',', '暗', '而', '不', '明', ',', '郁', '而', '不', '发', ',', '天下', '之', '人', '各', '为', '其', '所', '欲', '焉', '以', '自', '为', '方', '。']

  4. 词性标注

    from jiayan import CRFPOSTagger
    
    words = ['天下', '大乱', ',', '贤圣', '不', '明', ',', '道德', '不', '一', ',', '天下', '多', '得', '一', '察', '焉', '以', '自', '好', '。']
    
    postagger = CRFPOSTagger()
    postagger.load('pos_model')
    print(postagger.postag(words))
    

    结果:
    ['n', 'a', 'wp', 'n', 'd', 'a', 'wp', 'n', 'd', 'm', 'wp', 'n', 'a', 'u', 'm', 'v', 'r', 'p', 'r', 'a', 'wp']

  5. 断句

    from jiayan import load_lm
    from jiayan import CRFSentencizer
    
    text = '天下大乱贤圣不明道德不一天下多得一察焉以自好譬如耳目皆有所明不能相通犹百家众技也皆有所长时有所用虽然不该不遍一之士也判天地之美析万物之理察古人之全寡能备于天地之美称神之容是故内圣外王之道暗而不明郁而不发天下之人各为其所欲焉以自为方悲夫百家往而不反必不合矣后世之学者不幸不见天地之纯古之大体道术将为天下裂'
    
    lm = load_lm('jiayan.klm')
    sentencizer = CRFSentencizer(lm)
    sentencizer.load('cut_model')
    print(sentencizer.sentencize(text))
    

    结果:
    ['天下大乱', '贤圣不明', '道德不一', '天下多得一察焉以自好', '譬如耳目', '皆有所明', '不能相通', '犹百家众技也', '皆有所长', '时有所用', '虽然', '不该不遍', '一之士也', '判天地之美', '析万物之理', '察古人之全', '寡能备于天地之美', '称神之容', '是故内圣外王之道', '暗而不明', '郁而不发', '天下之人各为其所欲焉以自为方', '悲夫', '百家往而不反', '必不合矣', '后世之学者', '不幸不见天地之纯', '古之大体', '道术将为天下裂']

  6. 标点

    from jiayan import load_lm
    from jiayan import CRFPunctuator
    
    text = '天下大乱贤圣不明道德不一天下多得一察焉以自好譬如耳目皆有所明不能相通犹百家众技也皆有所长时有所用虽然不该不遍一之士也判天地之美析万物之理察古人之全寡能备于天地之美称神之容是故内圣外王之道暗而不明郁而不发天下之人各为其所欲焉以自为方悲夫百家往而不反必不合矣后世之学者不幸不见天地之纯古之大体道术将为天下裂'
    
    lm = load_lm('jiayan.klm')
    punctuator = CRFPunctuator(lm, 'cut_model')
    punctuator.load('punc_model')
    print(punctuator.punctuate(text))
    

    结果:
    天下大乱,贤圣不明,道德不一,天下多得一察焉以自好,譬如耳目,皆有所明,不能相通,犹百家众技也,皆有所长,时有所用,虽然,不该不遍,一之士也,判天地之美,析万物之理,察古人之全,寡能备于天地之美,称神之容,是故内圣外王之道,暗而不明,郁而不发,天下之人各为其所欲焉以自为方,悲夫!百家往而不反,必不合矣,后世之学者,不幸不见天地之纯,古之大体,道术将为天下裂。

版本

  • v0.0.21
    • 将安装过程分为两步,确保得到最新的kenlm版本。
  • v0.0.2
    • 增加词性标注功能。
  • v0.0.1
    • 词库构建、自动分词、文言句读、标点功能开放。

Introduction

Jiayan, which means Chinese characters engraved on oracle bones, is a professional Python NLP tool for Classical Chinese.
Prevailing Chinese NLP tools are mainly trained on modern Chinese data, which leads to bad performance on Classical Chinese (See Tokenizing). The purpose of this project is to assist Classical Chinese information processing.
Current version supports lexicon construction, tokenizing, POS tagging, sentence segmentation and automatic punctuation, more features are in development.

Features

  • Lexicon Construction
    • With an unsupervised approach, construct lexicon with Trie -tree, PMI (point-wise mutual information) and neighboring entropy of left and right characters.
  • Tokenizing
    • With an unsupervised, no dictionary approach to tokenize a Classical Chinese sentence with N-gram language model and HMM (Hidden Markov Model).
    • With the dictionary produced from lexicon construction, tokenize a Classical Chinese sentence with Directed Acyclic Word Graph, Max Probability Path and Dynamic Programming.
  • POS Tagging
    • Word level sequence tagging with CRF (Conditional Random Field). See POS tag categories here.
  • Sentence Segmentation
    • Character level sequence tagging with CRF, introduces PMI and T-test values as features.
  • Punctuation
    • Character level sequence tagging with layered CRFs, punctuate given Classical Chinese texts based on results of sentence segmentation.
  • Note: Due to data we used, we don't support traditional Chinese for now. If you have to process traditional one, please use OpenCC to convert traditional input to simplified, then you could convert the results back.

Installation

$ pip install jiayan 
$ pip install https://github.com/kpu/kenlm/archive/master.zip

Usages

The usage codes below are all from examples.py.

  1. Download the models and unzip them:Google Drive

    • jiayan.klm:the language model used for tokenizing and feature extraction for sentence segmentation and punctuation;
    • pos_model:the CRF model for POS tagging;
    • cut_model:the CRF model for sentence segmentation;
    • punc_model:the CRF model for punctuation;
    • 庄子.txt:the full text of 《Zhuangzi》 used for testing lexicon construction.
  2. Lexicon Construction

    from jiayan import PMIEntropyLexiconConstructor
    
    constructor = PMIEntropyLexiconConstructor()
    lexicon = constructor.construct_lexicon('庄子.txt')
    constructor.save(lexicon, 'Zhuangzi_Lexicon.csv')
    

    Result:

    Word,Frequency,PMI,R_Entropy,L_Entropy
    之,2999,80,7.944909328101839,8.279435615456894
    而,2089,80,7.354575005231323,8.615211168836439
    不,1941,80,7.244331150611089,6.362131306822925
    ...
    天下,280,195.23602384978196,5.158574399464853,5.24731990592901
    圣人,111,150.0620531154239,4.622606551534004,4.6853474419338585
    万物,94,377.59805590304126,4.5959107835319895,4.538837960294887
    天地,92,186.73504238078462,3.1492586603863617,4.894533538722486
    孔子,80,176.2550051738876,4.284638190120882,2.4056390622295662
    庄子,76,169.26227942514097,2.328252899085616,2.1920058354921066
    仁义,58,882.3468468468468,3.501609497059026,4.96900162987599
    老聃,45,2281.2228260869565,2.384853500510039,2.4331958387289765
    ...
    
  3. Tokenizing

    1. The character based HMM, recommended, needs language model: jiayan.klm

      from jiayan import load_lm
      from jiayan import CharHMMTokenizer
      
      text = '是故内圣外王之道,暗而不明,郁而不发,天下之人各为其所欲焉以自为方。'
      
      lm = load_lm('jiayan.klm')
      tokenizer = CharHMMTokenizer(lm)
      print(list(tokenizer.tokenize(text)))
      

      Result:
      ['是', '故', '内圣外王', '之', '道', ',', '暗', '而', '不', '明', ',', '郁', '而', '不', '发', ',', '天下', '之', '人', '各', '为', '其', '所', '欲', '焉', '以', '自', '为', '方', '。']

      Since there is no public tokenizing data for Classical Chinese, it's hard to do performance evaluation directly; However, we can compare the results with other popular modern Chinese NLP tools to check the performance:

      Compare the tokenizing result of LTP (3.4.0):
      ['是', '故内', '圣外王', '之', '道', ',', '暗而不明', ',', '郁', '而', '不', '发', ',', '天下', '之', '人', '各', '为', '其', '所', '欲', '焉以自为方', '。']

      Also, compare the tokenizing result of HanLP:
      ['是故', '内', '圣', '外', '王之道', ',', '暗', '而', '不明', ',', '郁', '而', '不', '发', ',', '天下', '之', '人', '各为其所欲焉', '以', '自为', '方', '。']

      It's apparent that Jiayan has much better tokenizing performance than general Chinese NLP tools.

    2. Max probability path approach tokenizing based on words

      from jiayan import WordNgramTokenizer
      
      text = '是故内圣外王之道,暗而不明,郁而不发,天下之人各为其所欲焉以自为方。'
      tokenizer = WordNgramTokenizer()
      print(list(tokenizer.tokenize(text)))
      

      Result:
      ['是', '故', '内', '圣', '外', '王', '之', '道', ',', '暗', '而', '不', '明', ',', '郁', '而', '不', '发', ',', '天下', '之', '人', '各', '为', '其', '所', '欲', '焉', '以', '自', '为', '方', '。']

  4. POS Tagging

    from jiayan import CRFPOSTagger
    
    words = ['天下', '大乱', ',', '贤圣', '不', '明', ',', '道德', '不', '一', ',', '天下', '多', '得', '一', '察', '焉', '以', '自', '好', '。']
    
    postagger = CRFPOSTagger()
    postagger.load('pos_model')
    print(postagger.postag(words))
    

    Result:
    ['n', 'a', 'wp', 'n', 'd', 'a', 'wp', 'n', 'd', 'm', 'wp', 'n', 'a', 'u', 'm', 'v', 'r', 'p', 'r', 'a', 'wp']

  5. Sentence Segmentation

    from jiayan import load_lm
    from jiayan import CRFSentencizer
    
    text = '天下大乱贤圣不明道德不一天下多得一察焉以自好譬如耳目皆有所明不能相通犹百家众技也皆有所长时有所用虽然不该不遍一之士也判天地之美析万物之理察古人之全寡能备于天地之美称神之容是故内圣外王之道暗而不明郁而不发天下之人各为其所欲焉以自为方悲夫百家往而不反必不合矣后世之学者不幸不见天地之纯古之大体道术将为天下裂'
    
    lm = load_lm('jiayan.klm')
    sentencizer = CRFSentencizer(lm)
    sentencizer.load('cut_model')
    print(sentencizer.sentencize(text))
    

    Result:
    ['天下大乱', '贤圣不明', '道德不一', '天下多得一察焉以自好', '譬如耳目', '皆有所明', '不能相通', '犹百家众技也', '皆有所长', '时有所用', '虽然', '不该不遍', '一之士也', '判天地之美', '析万物之理', '察古人之全', '寡能备于天地之美', '称神之容', '是故内圣外王之道', '暗而不明', '郁而不发', '天下之人各为其所欲焉以自为方', '悲夫', '百家往而不反', '必不合矣', '后世之学者', '不幸不见天地之纯', '古之大体', '道术将为天下裂']

  6. Punctuation

    from jiayan import load_lm
    from jiayan import CRFPunctuator
    
    text = '天下大乱贤圣不明道德不一天下多得一察焉以自好譬如耳目皆有所明不能相通犹百家众技也皆有所长时有所用虽然不该不遍一之士也判天地之美析万物之理察古人之全寡能备于天地之美称神之容是故内圣外王之道暗而不明郁而不发天下之人各为其所欲焉以自为方悲夫百家往而不反必不合矣后世之学者不幸不见天地之纯古之大体道术将为天下裂'
    
    lm = load_lm('jiayan.klm')
    punctuator = CRFPunctuator(lm, 'cut_model')
    punctuator.load('punc_model')
    print(punctuator.punctuate(text))
    

    Result:
    天下大乱,贤圣不明,道德不一,天下多得一察焉以自好,譬如耳目,皆有所明,不能相通,犹百家众技也,皆有所长,时有所用,虽然,不该不遍,一之士也,判天地之美,析万物之理,察古人之全,寡能备于天地之美,称神之容,是故内圣外王之道,暗而不明,郁而不发,天下之人各为其所欲焉以自为方,悲夫!百家往而不反,必不合矣,后世之学者,不幸不见天地之纯,古之大体,道术将为天下裂。

Versions

  • v0.0.21
    • Divide the installation into two steps to ensure to get the latest version of kenlm.
  • v0.0.2
    • POS tagging feature is open.
  • v0.0.1
    • Add features of lexicon construction, tokenizing, sentence segmentation and automatic punctuation.

jiayan's People

Contributors

jiaeyan avatar koichiyasuoka avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jiayan's Issues

能否更新HanLP的分词结果?HanLP2.x的深度学习模型在古汉语上的效果大幅提升了

感谢你们的工作,Jiayan在古汉语处理上独树一帜,也感谢与HanLP对比。

我注意到文档中HanLP的效果应该是1.x,的确不太好。不过自从2021年初,HanLP发布了深度学习驱动的2.x。由于使用了大规模语料上预训练的语言模型,这些语料已经包括了互联网上几乎所有的古汉语和现代汉语,所以在古汉语上的效果已经得到了质的提升。不仅仅是分词,就连词性标注和语义分析也有一定zero-shot learning的效果。例如:

from hanlp_restful import HanLPClient

HanLP = HanLPClient('https://www.hanlp.com/api')
HanLP('是故内圣外王之道,暗而不明,郁而不发,天下之人各为其所欲焉以自为方。').pretty_print()

Dep Tree        	Toke	Relati	PoS	Tok 	SRL PA1     	Tok 	SRL PA2     	Tok 	SRL PA3 	Tok 	SRL PA4     	Tok 	PoS    3       4       5       6       7       8       9       10
────────────────	────	──────	───	────	────────────	────	────────────	────	────────	────	────────────	────	─────────────────────────────────────────────────────────────────
┌┬─┬┬────────┬──	   	root  	VC 	   	            	   	            	   	        	   	            	   	VC ──────────────────────────────────┐                           
││ ││        └─►	   	advmod	AD 	   	───►ARGM-DIS	   	            	   	        	   	            	   	AD ───────────────────────────►ADVP──┤                           
││ ││     ┌─►┌──	内圣外王	nn    	NN 	内圣外王	◄─┐         	内圣外王	            	内圣外王	        	内圣外王	            	内圣外王	NN ───►NP ───┐                       │                           
││ ││     │  └─►	   	assm  	DEG	   	  ├►ARG0    	   	            	   	        	   	            	   	DEG──────────┴►DNP ──┐               │                           
││ ││  ┌─►└─────	   	nsubj 	NN 	   	◄─┘         	   	            	   	        	   	            	   	NN ───────────►NP ───┴────────►NP────┤                           
││ ││  │     ┌─►	,   	punct 	PU 	,   	            	,   	            	,   	        	,   	            	,   	PU ──────────────────────────────────┼────────────────►IP ───┐   
││ │└─►└┬──┬┬┼──	   	dep   	VA 	   	╟──►PRED    	   	            	   	        	   	            	   	VA ──────────┐                       │                       │   
││ │    │  ││└─►	   	prtmod	MSP	   	            	   	            	   	        	   	            	   	MSP──────────┼────────►VP ───┐       │                       │   
││ │    │  │└──►	不明  	dep   	VA 	不明  	            	不明  	            	不明  	        	不明  	            	不明  	VA ───►VP ───┘               │       │                       │   
││ │    │  └───►	,   	punct 	PU 	,   	            	,   	            	,   	        	,   	            	,   	PU ──────────────────────────┤       │                       │   
││ │    │  ┌───►	   	dep   	VA 	   	            	   	            	   	        	   	            	   	VA ───────────►VP ───┐       ├►VP ───┘                       │   
││ │    │  │┌──►	   	prtmod	MSP	   	            	   	            	   	        	   	            	   	MSP──────────────────┤       │                               │   
││ │    │  ││┌─►	   	neg   	AD 	   	            	   	───►ARGM-ADV	   	        	   	            	   	AD ───►ADVP──┐       ├►VP ───┘                               │   
││ │    └─►└┴┴──	   	dep   	VV 	   	            	   	╟──►PRED    	   	        	   	            	   	VV ───►VP ───┴►VP ───┘                                       │   
││ └───────────►	,   	punct 	PU 	,   	            	,   	            	,   	        	,   	            	,   	PU ──────────────────────────────────────────────────────────┤   
││        ┌─►┌──	天下  	assmod	NN 	天下  	            	天下  	            	天下  	        	天下  	◄─┐         	天下  	NN ───►NP ───┐                                               │   
││        │  └─►	   	assm  	DEG	   	            	   	            	   	        	   	  ├►ARG0    	   	DEG──────────┴►DNP ──┐                                       │   
││  ┌────►└─────	   	nsubj 	NN 	   	            	   	            	   	        	   	◄─┘         	   	NN ───────────►NP ───┴────────────────────────►NP ───┐       ├►IP
││  │┌─────────►	   	advmod	AD 	   	            	   	            	   	        	   	───►ARGM-ADV	   	AD ───────────────────────────►ADVP──┐               │       │   
││  ││┌─►┌──────	   	prep  	P  	   	            	   	            	   	        	   	            	   	P ───────────────────────────┐       ├►VP ───┐       │       │   
││  │││  │  ┌──►	   	nsubj 	PN 	   	            	   	            	   	───►ARG0	   	            	   	PN ───────────►NP ───┐       ├►VP ───┘       │       ├►IP────┤   
││  │││  │  │┌─►	   	prtmod	MSP	   	            	   	            	   	        	   	            	   	MSP──────────┐       ├►IP ───┘               │       │       │   
││  │││  └─►└┴──	   	dep   	VV 	   	            	   	            	   	╟──►PRED	   	            	   	VV ───►VP ───┴►VP ───┘                       │       │       │   
││  │││  ┌─────►	   	dep   	SP 	   	            	   	            	   	        	   	            	   	SP ──────────────────────────────────────────┼►VP ───┘       │   
││  │││  │┌─►┌──	   	prep  	P  	   	            	   	            	   	        	   	◄─┐         	   	P ───────────┐                               │               │   
││  │││  ││  └─►	   	pobj  	PN 	   	            	   	            	   	        	   	◄─┴►ARG2    	   	PN ───►NP ───┴►PP ───┐                       │               │   
│└─►└┴┴──┴┴──┬──	   	dep   	VV 	   	            	   	            	   	        	   	╟──►PRED    	   	VV ──────────┐       ├────────────────►VP ───┘               │   
│            └─►	   	dobj  	NN 	   	            	   	            	   	        	   	───►ARG1    	   	NN ───►NP ───┴►VP ───┘                                       │   
└──────────────►	。   	punct 	PU 	。   	            	。   	            	。   	        	。   	            	。   	PU ──────────────────────────────────────────────────────────┘   

可以在线体验其他古汉语句子的效果。方便的话,能否更新HanLP的分词结果?

谢谢。

Converting Jiayan's POS into UPOS of Universal Dependencies

I'm now trying to convert Jiayan's POS tag into UPOS tagset of Universal Dependencies. But I'm vague what auxiliary means in Jiayan's POS and cannot determine proper UPOS for Jiayan's u. Then I've tried to compare UPOS (which is used in the train file of https://github.com/UniversalDependencies/UD_Classical_Chinese-Kyoto) with Jiayan's CRFPOSTagger:

% git clone https://github.com/UniversalDependencies/UD_Classical_Chinese-Kyoto.git
% python
>>> from jiayan import CRFPOSTagger
>>> postagger=CRFPOSTagger()
>>> postagger.load("jiayan_models/pos_model")
>>> from opencc import OpenCC
>>> t2s=OpenCC("t2s").convert
>>> train=open("UD_Classical_Chinese-Kyoto/lzh_kyoto-ud-train.conllu","r")
>>> ud=[t.split("\t") for t in train.read().split("\n") if t!="" and not t.startswith("#")]
>>> train.close()
>>> form=[t[1] for t in ud]
>>> upos=[t[3] for t in ud]
>>> simp=[t2s(f) for f in form]
>>> jpos=postagger.postag(simp)
>>> import collections
>>> collections.Counter((f,u,s,j) for f,u,s,j in zip(form,upos,simp,jpos) if j=="u")
Counter({('之', 'SCONJ', '之', 'u'): 841, ('之', 'PRON', '之', 'u'): 561, ('矣', 'PART', '矣', 'u'): 289, ('所', 'PART', '所', 'u'): 173, ('乎', 'PART', '乎', 'u'): 166, ('也', 'PART', '也', 'u'): 101, ('哉', 'PART', '哉', 'u'): 99, ('焉', 'PART', '焉', 'u'): 83, ('乎', 'ADP', '乎', 'u'): 34, ('地', 'NOUN', '地', 'u'): 27, ('得', 'VERB', '得', 'u'): 24, ('之', 'VERB', '之', 'u'): 23, ('焉', 'ADV', '焉', 'u'): 16, ('過', 'VERB', '过', 'u'): 15, ('過', 'NOUN', '过', 'u'): 10, ('焉', 'PRON', '焉', 'u'): 10, ('所', 'NOUN', '所', 'u'): 8, ('得', 'AUX', '得', 'u'): 7, ('兮', 'PART', '兮', 'u'): 6, ('者', 'PART', '者', 'u'): 4, ('等', 'NOUN', '等', 'u'): 3, ('等', 'VERB', '等', 'u'): 2, ('般', 'VERB', '般', 'u'): 2, ('夫', 'PART', '夫', 'u'): 1, ('鄭', 'PROPN', '郑', 'u'): 1, ('其', 'PRON', '其', 'u'): 1, ('否', 'VERB', '否', 'u'): 1, ('之', 'PROPN', '之', 'u'): 1, ('連', 'VERB', '连', 'u'): 1, ('般', 'PROPN', '般', 'u'): 1, ('斯', 'ADV', '斯', 'u'): 1})

After the comparison PART seems suitable for u except for 之. But I wonder why 之 of PRON do not go Jiayan's r pronoun. Also I wonder why 地 of NOUN do not go Jiayan's n noun. Are there any documentations about Jiayan's POS tag?

编码问题

尝试教程提取庄子词频的时候,报错:
'charmap' codec can't encode character '\u4e4b'
PMIEntropyLexiconConstructor类中的save函数添加一个encoding='utf8'的参数可以解决这个问题,不知道这是不是我个例的问题。

语料库

想请问一下作者处理好的语料可以分享吗?
就是已经给文章打好标签和分好词的语料

请教关于词性自动标注的问题

您好,
关于这个项目中"实现词性标注模型的细节(基于条件随机场的词性标注)", 可以提供相关的论文或者博客吗? 我想深入学习一下该算法的设计, 以及原理.
多谢

词长

我看你代码里面,词长设定最大是4。我觉得这样有点问题。古汉语里面最大词长设置成2或者3更合适。可能有一些称呼是三音节,比如“秦穆公”。其他的词基本上都是单音节的,双音节的占一定的比例,但是不多。最大词长设置成4的话,情况就是分出来的四音节的都不是词。

关于jiayan.klm

作者您好,请问您的jiayan.klm这个文件是以什么方式运行的,我要怎样才能获得属于我自己的klm后缀文件呢?

关于分词的问题

您好,我想请问一下jiayan当中的分词,我想将这个分词的功能使用到语言模型的分词器当中,但是我发现使用jiayan分词以后效果不太好,您有什么方案可以改进一下吗,非常感谢

Sentencize seems not work.

My python code:

# print('\nSentencizing test text with CRF...')
    LM_path = os.path.join(os.path.dirname(__file__), '.', 'models', 'jiayan.klm')
    cut_model_path = os.path.join(os.path.dirname(__file__), '.', 'models', 'cut_model')
    result = crf_sentencize(LM_path, cut_model_path, test_f1)
    print(result)
    print(''.join(result).decode('utf-8'))

output:

(venv)  $: python examples.py
['\xe5', '\x9c', '\xa3', '\xe4', '\xba', '\xba', '\xe4', '\xb9', '\x8b', '\xe6', '\xb2', '\xbb', '\xe6', '\xb0', '\x91', '\xe4', '\xb9', '\x9f', '\xe5', '\x85', '\x88', '\xe6', '\xb2', '\xbb', '\xe8', '\x80', '\x85', '\xe5', '\xbc', '\xba', '\xe5', '\x85', '\x88', '\xe6', '\x88', '\x98', '\xe8', '\x80', '\x85', '\xe8', '\x83', '\x9c', '\xe5', '\xa4', '\xab', '\xe5', '\x9b', '\xbd', '\xe4', '\xba', '\x8b', '\xe5', '\x8a', '\xa1', '\xe5', '\x85', '\x88', '\xe8', '\x80', '\x8c', '\xe4', '\xb8', '\x80', '\xe6', '\xb0', '\x91', '\xe5', '\xbf', '\x83', '\xe4', '\xb8', '\x93', '\xe4', '\xb8', '\xbe', '\xe5', '\x85', '\xac', '\xe8', '\x80', '\x8c', '\xe7', '\xa7', '\x81', '\xe4', '\xb8', '\x8d', '\xe4', '\xbb', '\x8e', '\xe8', '\xb5', '\x8f', '\xe5', '\x91', '\x8a', '\xe8', '\x80', '\x8c', '\xe5', '\xa5', '\xb8', '\xe4', '\xb8', '\x8d', '\xe7', '\x94', '\x9f', '\xe6', '\x98', '\x8e', '\xe6', '\xb3', '\x95', '\xe8', '\x80', '\x8c', '\xe6', '\xb2', '\xbb', '\xe4', '\xb8', '\x8d', '\xe7', '\x83', '\xa6', '\xe8', '\x83', '\xbd', '\xe7', '\x94', '\xa8', '\xe5', '\x9b', '\x9b', '\xe8', '\x80', '\x85', '\xe5', '\xbc', '\xba', '\xe4', '\xb8', '\x8d', '\xe8', '\x83', '\xbd', '\xe7', '\x94', '\xa8', '\xe5', '\x9b', '\x9b', '\xe8', '\x80', '\x85', '\xe5', '\xbc', '\xb1', '\xe5', '\xa4', '\xab', '\xe5', '\x9b', '\xbd', '\xe4', '\xb9', '\x8b', '\xe6', '\x89', '\x80', '\xe4', '\xbb', '\xa5', '\xe5', '\xbc', '\xba', '\xe8', '\x80', '\x85', '\xe6', '\x94', '\xbf', '\xe4', '\xb9', '\x9f', '\xe4', '\xb8', '\xbb', '\xe4', '\xb9', '\x8b', '\xe6', '\x89', '\x80', '\xe4', '\xbb', '\xa5', '\xe5', '\xb0', '\x8a', '\xe8', '\x80', '\x85', '\xe6', '\x9d', '\x83', '\xe4', '\xb9', '\x9f', '\xe6', '\x95', '\x85', '\xe6', '\x98', '\x8e', '\xe5', '\x90', '\x9b', '\xe6', '\x9c', '\x89', '\xe6', '\x9d', '\x83', '\xe6', '\x9c', '\x89', '\xe6', '\x94', '\xbf', '\xe4', '\xb9', '\xb1', '\xe5', '\x90', '\x9b', '\xe4', '\xba', '\xa6', '\xe6', '\x9c', '\x89', '\xe6', '\x9d', '\x83', '\xe6', '\x9c', '\x89', '\xe6', '\x94', '\xbf', '\xe7', '\xa7', '\xaf', '\xe8', '\x80', '\x8c', '\xe4', '\xb8', '\x8d', '\xe5', '\x90', '\x8c', '\xe5', '\x85', '\xb6', '\xe6', '\x89', '\x80', '\xe4', '\xbb', '\xa5', '\xe7', '\xab', '\x8b', '\xe5', '\xbc', '\x82', '\xe4', '\xb9', '\x9f', '\xe6', '\x95', '\x85', '\xe6', '\x98', '\x8e', '\xe5', '\x90', '\x9b', '\xe6', '\x93', '\x8d', '\xe6', '\x9d', '\x83', '\xe8', '\x80', '\x8c', '\xe4', '\xb8', '\x8a', '\xe9', '\x87', '\x8d', '\xe4', '\xb8', '\x80', '\xe6', '\x94', '\xbf', '\xe8', '\x80', '\x8c', '\xe5', '\x9b', '\xbd', '\xe6', '\xb2', '\xbb', '\xe6', '\x95', '\x85', '\xe6', '\xb3', '\x95', '\xe8', '\x80', '\x85', '\xe7', '\x8e', '\x8b', '\xe4', '\xb9', '\x8b', '\xe6', '\x9c', '\xac', '\xe4', '\xb9', '\x9f', '\xe5', '\x88', '\x91', '\xe8', '\x80', '\x85', '\xe7', '\x88', '\xb1', '\xe4', '\xb9', '\x8b', '\xe8', '\x87', '\xaa', '\xe4', '\xb9', '\x9f']
圣人之治民也先治者强先战者胜夫国事务先而一民心专举公而私不从赏告而奸不生明法而治不烦能用四者强不能用四者弱夫国之所以强者政也主之所以尊者权也故明君有权有政乱君亦有权有政积而不同其所以立异也故明君操权而上重一政而国治故法者王之本也刑者爱之自也

pip安装失败

您好,该项目是否已经不支持python2.7版本?

pip install failed

Hi, I've just tried to install jiayan with

pip install jiayan

but it failed in compiling kenlm. The pip version of kenlm seems very old and did not catch up recent modifications made in the github version of kenlm. Then, how do I install jiayan safely? May I install kenlm with

pip install https://github.com/kpu/kenlm/archive/master.zip

before installing jiayan?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.