Git Product home page Git Product logo

richard-peng-xia / chinese-noisy-text Goto Github PK

View Code? Open in Web Editor NEW
17.0 1.0 3.0 70 KB

This repository stores the code of the data augmentation method from Chinese word and character levels, which adds noise to words and characters in redundant, missing, selection and ordering respectively.

License: MIT License

Python 99.35% Shell 0.65%
chinese-nlp grammatical-error-correction noise-generator

chinese-noisy-text's Introduction

Chinese-Noisy-Text

OSCS Status Github stars Maintenance PR's Welcome DOI

This repository stores the code of the data augmentation method from Chinese word and character levels, which adds noise to words and characters in redundant, missing, selection and ordering respectively.

Requirements & Datasets

pip install -r requirements.txt

Due to copyright restrictions, the two folders have not been uploaded. You can download them from these two links below if you need them.

ChineseHomophones | SimilarCharacter

Usage

I've implemented the 5 noise functions described in the paper:

  1. Delete words with given probability (default is 0.163)
  2. Replace words by a similar words and homophone with given probability (default is 0.163)
  3. Swap words up to a certain range (default range is 0.163)
  4. Repeat words with given probability (default is 0.163)
  5. Select any of the above noise functions at random

I set the error rate of each time as 16.3%, which can maintain the error rate of the corpus after double noise at 30% (calculated according to mathematical expectation).

Example of simple usage

bash noise.sh

Example of complete usage

python add_noise/add_noise.py --input input_file_path --redundant_probability 0.2 --selection_probability 0.2 --missing_probability 0.2 --ordering_probability 0.2 --comprehensive_error_probability 0.2

Results

image Examples of noisy texts

I've run Chinese grammatical error correction experiments on Chinese Wikipedia corpus, using all available parallel data.

I added noise to it using this repo, giving the following results on NLPCC 2018 test set. All results are $F_{0.5}$ Scores.

The table below reports a Transformer model identical to the "base model" in Vaswani et al. (2017).

Model $F_{0.5}$ 
baseline 33.17
baseline+noise 35.55

Transformer base model

Reference

@article{xia2022chinese,
  title={Chinese grammatical error correction based on knowledge distillation},
  author={Xia, Peng and Zhou, Yuechi and Zhang, Ziyan and Tang, Zecheng and Li, Juntao},
  journal={arXiv preprint arXiv:2208.00351},
  year={2022}
}

@misc{Xia2022ChineseNoisyText,
  author = {Peng Xia},
  title = {Chinese-Noisy-Text},
  year = {2022},
  version = {doi},
  doi = {10.5281/zenodo.7025129},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Richard88888/Chinese-Noisy-Text}}
}

@inproceedings{tang2021基于字词粒度噪声数据增强的中文语法纠错,
  title={基于字词粒度噪声数据增强的中文语法纠错 (Chinese Grammatical Error Correction enhanced by Data Augmentation from Word and Character Levels)},
  author={Tang, Zecheng and Ji, Yixin and Zhao, Yibo and Li, Junhui},
  booktitle={Proceedings of the 20th Chinese National Conference on Computational Linguistics},
  pages={813--824},
  year={2021}
}

Notes

Do not hesitate to contact me if you need some help, need a feature or see some bug

Feel free and welcome to contribute

chinese-noisy-text's People

Contributors

richard-peng-xia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.