The chinesecorrector from liu-hanwen

#README

Author: Hanwen, LIU - HKUST

A Chinese words correction system with detection and correction functions based on n-gram language model and Chinese text segmentation. The detection core focused on the continuous singletons while the correction core focuses on the the shape and pronunciation similarity of characters.

Usage

In [1]: import Checker

PREPROCESSING DONE!

In [2]: fix = Checker.correct_core('我已经等猴多时了。')

In [3]: for word in fix:
   ...:     print(word)
   ...:
我
已经
['等候多时', '得道多助', '勇而多计']
了
。 

In [4]: fix = Checker.correct_core('我已经等猴多时了。')

In [5]: answer = ''

In [6]: for word in fix:
   ...:     if type(word)==list:
   ...:         answer+=word[0]
   ...:     else:
   ...:         answer+=word
   ...:

In [7]: print(answer)
我已经等候多时了。

Please make sure that the data files mentioned in the Checker.py be downloaded! Some files are too large, please download them from Google Drive: https://drive.google.com/open?id=1A_rifWNTVLkPeTfKTPN-KaeaqTj09-IG

Files

Checker.py: Detection and correction system module.
CharSimilarity.py: Characters similarity measurement module.
Experimental Results.ipynb: Experimental Results.
sijiao_dict.py: Sijiao codes of characters from https://github.com/contr4l/SimilarCharactor.
similar_char_preprocessing.py: The cache of all similarity values between common characters.
testing data(folder): Testing data files
chinese_word_correction_data.json: The original training data.(Provided by Porf.Lei CHEN)
weibo_contents_words.set: The vocabulary file of training data.
weibo_contents_words.bin: The trained binary KenLM language model.
pd_simi_dic.pkl: The similarity cache file, can be generated by similar_char_preprocessing.py.

Future Work

Although the performance of our correction system is acceptable, there are some problems which should be solved in the future.

First of all, the correction speed of out design is extremely low. We have designed some elaborate algorithm to speed up the system, for example, we calculated common characters' similarity values with each other in advance. However, the combinations number is large and the comparison times are many, so the total speed of correction is very low. Because of the low speed, we can only test our corrector with small test data which may not be compellent.

The next problem is the detection accuracy. We have noticed that if there are several singleton words appear which are not error words, the detector will treat them as error words since this is how our detection algorithm works.

The last problem is the candidates selecting algorithm in the correction part. We select the candidates by combining the candidates with the prefix words and query the language model for score. However, the score of candidates with longer length may probably get higher score. For example, when correcting the sentences '平民刘备鉴持卟鞋', the prefix words are '平民' and '刘备' and the candidates are '坚持/补鞋' and '坚持不懈'. Although the first candidate is more similar, but the score of the first candidate may probably be lower than the second one, because the 4-gram score usually lower than 3-gram score.

Reference

[1] J. L. Peterson, Computer programs for detecting and correcting spelling errors, Communica- tions of the ACM 23 (1980) 676–687.

[2] S. Cucerzan, E. Brill, Spelling correction as an iterative process that exploits the collective knowledge of web users, in: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing.

[3] F. Ahmad, G. Kondrak, Learning a spelling error model from search query logs, in: Proceed- ings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 955–962.

[4] C.-H. Chang, A new approach for automatic chinese spelling correction, in: Proceedings of Natural Language Processing Pacific Rim Symposium, volume 95, Citeseer, pp. 278–283.

[5] Y. Zheng, C. Li, M. Sun, Chime: An efficient error-tolerant chinese pinyin input method, in: IJCAI, volume 11, pp. 2551–2556.

[6] fxsjy, Jieba - github.com, https://github.com/fxsjy/jieba.

[7] K. Heafield, I. Pouzyrevsky, J. H. Clark, P. Koehn, Scalable modified Kneser-Ney language model estimation, in: Proceedings of the 51st Annual Meeting of the Association for Compu- tational Linguistics, Sofia, Bulgaria, pp. 690–696.

[8] K. Heafield, KenLM: faster and smaller language model queries, in: Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation, Edinburgh, Scotland, United Kingdom, pp. 187–197.

[9] K. Heafield, Kenlm homepage, https://kheafield.com/code/kenlm/.

[10] D. China, A chinese text similarity measurement algorithm based on ssc, https://blog.csdn.net/chndata/article/details/41114771.

[11] contrl4, Similarcharactor, https://github.com/contr4l/SimilarCharactor.

liu-hanwen / chinesecorrector Goto Github PK

chinesecorrector's Introduction

Usage

Files

Future Work

Reference

chinesecorrector's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent