Git Product home page Git Product logo

chinesecorrector's Introduction

#README

Author: Hanwen, LIU - HKUST

A Chinese words correction system with detection and correction functions based on n-gram language model and Chinese text segmentation. The detection core focused on the continuous singletons while the correction core focuses on the the shape and pronunciation similarity of characters.

Usage

In [1]: import Checker

PREPROCESSING DONE!

In [2]: fix = Checker.correct_core('我已经等猴多时了。')

In [3]: for word in fix:
   ...:     print(word)
   ...:

已经
['等候多时', '得道多助', '勇而多计']
In [4]: fix = Checker.correct_core('我已经等猴多时了。')

In [5]: answer = ''

In [6]: for word in fix:
   ...:     if type(word)==list:
   ...:         answer+=word[0]
   ...:     else:
   ...:         answer+=word
   ...:

In [7]: print(answer)
我已经等候多时了

Please make sure that the data files mentioned in the Checker.py be downloaded! Some files are too large, please download them from Google Drive: https://drive.google.com/open?id=1A_rifWNTVLkPeTfKTPN-KaeaqTj09-IG

Files

  • Checker.py: Detection and correction system module.
  • CharSimilarity.py: Characters similarity measurement module.
  • Experimental Results.ipynb: Experimental Results.
  • sijiao_dict.py: Sijiao codes of characters from https://github.com/contr4l/SimilarCharactor.
  • similar_char_preprocessing.py: The cache of all similarity values between common characters.
  • testing data(folder): Testing data files
  • chinese_word_correction_data.json: The original training data.(Provided by Porf.Lei CHEN)
  • weibo_contents_words.set: The vocabulary file of training data.
  • weibo_contents_words.bin: The trained binary KenLM language model.
  • pd_simi_dic.pkl: The similarity cache file, can be generated by similar_char_preprocessing.py.

Future Work

Although the performance of our correction system is acceptable, there are some problems which should be solved in the future.

First of all, the correction speed of out design is extremely low. We have designed some elaborate algorithm to speed up the system, for example, we calculated common characters' similarity values with each other in advance. However, the combinations number is large and the comparison times are many, so the total speed of correction is very low. Because of the low speed, we can only test our corrector with small test data which may not be compellent.

The next problem is the detection accuracy. We have noticed that if there are several singleton words appear which are not error words, the detector will treat them as error words since this is how our detection algorithm works.

The last problem is the candidates selecting algorithm in the correction part. We select the candidates by combining the candidates with the prefix words and query the language model for score. However, the score of candidates with longer length may probably get higher score. For example, when correcting the sentences '平民刘备鉴持卟鞋', the prefix words are '平民' and '刘备' and the candidates are '坚持/补鞋' and '坚持不懈'. Although the first candidate is more similar, but the score of the first candidate may probably be lower than the second one, because the 4-gram score usually lower than 3-gram score.

Reference

[1] J. L. Peterson, Computer programs for detecting and correcting spelling errors, Communica- tions of the ACM 23 (1980) 676–687.

[2] S. Cucerzan, E. Brill, Spelling correction as an iterative process that exploits the collective knowledge of web users, in: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing.

[3] F. Ahmad, G. Kondrak, Learning a spelling error model from search query logs, in: Proceed- ings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 955–962.

[4] C.-H. Chang, A new approach for automatic chinese spelling correction, in: Proceedings of Natural Language Processing Pacific Rim Symposium, volume 95, Citeseer, pp. 278–283.

[5] Y. Zheng, C. Li, M. Sun, Chime: An efficient error-tolerant chinese pinyin input method, in: IJCAI, volume 11, pp. 2551–2556.

[6] fxsjy, Jieba - github.com, https://github.com/fxsjy/jieba.

[7] K. Heafield, I. Pouzyrevsky, J. H. Clark, P. Koehn, Scalable modified Kneser-Ney language model estimation, in: Proceedings of the 51st Annual Meeting of the Association for Compu- tational Linguistics, Sofia, Bulgaria, pp. 690–696.

[8] K. Heafield, KenLM: faster and smaller language model queries, in: Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation, Edinburgh, Scotland, United Kingdom, pp. 187–197.

[9] K. Heafield, Kenlm homepage, https://kheafield.com/code/kenlm/.

[10] D. China, A chinese text similarity measurement algorithm based on ssc, https://blog.csdn.net/chndata/article/details/41114771.

[11] contrl4, Similarcharactor, https://github.com/contr4l/SimilarCharactor.

chinesecorrector's People

Contributors

liu-hanwen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.