Comments (3)
Yes,z
could be interpreted as ADV
here. By the way, thank you for introducing the universal dependency project, which is very interesting, and an important missing piece of Jiayan. I will look into it and hopefully could make Jiayan support UD parsing in the future.
from jiayan.
Hi, yes you are right, PART
should be u
in this case. Let me answer your questions one by one:
-
Jiayan takes the same pos tag set as LTP, which is a popular NLP toolkit for modern Chinese, whose POS tag category is here. This POS tag set is defined in an important modern Chinese information processing project back in 2003 (let's call it 863 tagset). I tried to find the official page for you but unfortuantely it is obsolete now, so I haven't found any detailed documentations for the tag set for now.
-
From my understanding, the
auxiliary/u
in 863 POS tagset is notauxiliary verb
in English, and that might be the confusing point for you. It is more a functionalparticle
than aVERB
. So you are right,PART
in UPOS is suitable for 'u' in Jiayan. In modern Chinese, the words forparticle(助词)
andauxiliary verb(助动词)
only have one character difference, I guess the tagset maker confused with these two words at first. -
For
PRON
之 doesn't go to 'r' but 'u' instead in Jiayan, that's the propagation issue in the model. Since I didn't find any annotated Classical Chinese data when implementing this feature, it became the most difficult task in this project. The way I resolved it is to tokenize training data with the CharHMMTokenizer from Jiayan, and use LTP mentioned above to postag the tokenized data, so the model is in fact trained in a modern Chinese fashion, which for most of the words is acceptable, but for some words, like 之, is not. 之 is polysemous in Classical Chinese, which can be eitherPART
orPRON
, however, in modern Chinese, it can only bePART
, so you can see all 之s are tagged asu
even it should be aPRON
with the model. -
The same issue applies to 地. On the contrast of 之, 地 is a polysemous word in modern Chinese, which can either be
NOUN
orPART
, but can only beNOUN
in Classical Chinese. Therefore with the modern Chinese fashion tagging, it doesn't only go toNOUN
, but also goes toPART
, which isu
in Jiayan. In this case, you could convert allu
tagged 地 toNOUN
with post processing.
In conclusion, the modern Chinese tagging fashion of the Jiayan POS model could mistag the words that are polysemous in either Classical Chinese or modern Chinese. I will look deeper into the POS tagging feature to see what can be improved in the future.
Sorry for the long answer, hope my answer could help. Thank you very much!
from jiayan.
Thank you @jiaeyan for the information about LTP. I've just written a tentative table to convert POS of LTP into UPOS of Universal Denendencies:
a | ADJ |
b | NOUN |
c | CCONJ |
d | ADV |
e | INTJ |
g | NOUN |
h | PART |
i | NOUN |
j | PROPN |
k | PART |
m | NUM |
n | NOUN |
nd | NOUN |
nh | PROPN |
ni | PROPN |
nl | NOUN |
ns | PROPN |
nt | NOUN |
nz | PROPN |
o | INTJ |
p | ADP |
q | NOUN |
r | PRON |
u | PART |
v | VERB |
wp | PUNCT |
ws | X |
x | SYM |
In addition z
(descriptive words) should be converted into... ah well, into ADV
?
from jiayan.
Related Issues (20)
- Can I train the crf_sentencize model from the source code by myself ? HOT 1
- Sentencize seems not work. HOT 3
- 请问一下,jiayan.klm模型是用什么语料进行训练,我自己能否进一步改进模型 HOT 11
- 语料库 HOT 1
- 词长 HOT 1
- 关键词提取 HOT 1
- 请问断句任务的tag分别是什么含义? HOT 1
- 请问方便告知,断句使用的是什么训练语料吗 HOT 1
- 您好,初学者想问问怎么查看模型的具体内容。。。。 HOT 2
- 能否更新HanLP的分词结果?HanLP2.x的深度学习模型在古汉语上的效果大幅提升了 HOT 2
- 您好,我可以用新語料重新生成模型嗎? HOT 1
- 关于jiayan.klm HOT 2
- 请教关于词性自动标注的问题
- pip安装失败
- 分词支持载入用户词典吗? HOT 2
- 编码问题 HOT 1
- 升级Python3.9后无法使用甲言 HOT 1
- 关于分词的问题 HOT 1
- python3.8,windows11在conda环境下根据步骤安装jiayan后,无法正确导入模块
- 能否添加词性标记的中文含义说明?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jiayan.