Git Product home page Git Product logo

ctatool's Introduction

Chinese Text Augmentation Tool

Revision based on https://github.com/jasonwei20/eda_nlp/tree/master/code.

Please refer to it for all the hyperparameters under output.

Internal Structure:

  • main.py: read in and write out files, feel free to customize it on your own
  • functions.py: the 4 synonym_replacement, random_insertion, random_deletion, random_swap functions and the gen_eda function that utilizes them all, given the alpha parameters

warning: the synonym_replacement is a coarse attempt and may need to be heavily revised if you want a more refined result

  • cache.py: cache the Chinese Wordnet model's lemmas and the corresponding synonym list for later use (because the lemma search is O(n))

1. For first use: download the necessary data

Chinese Wordnet (CWN):

https://github.com/lopentu/CwnGraph

      #(shell) 
      git clone https://github.com/lopentu/CwnGraph
      #(colab) 
      !git clone https://github.com/lopentu/CwnGraph

Ckiptagger:

https://github.com/ckiplab/ckiptagger (this is the ckipdata specified in main.py arguments)

      data_utils.download_data_gdown("./") 

2. Usage:

default setting

 python3 main.py --input=./aicup_dataset/Train_qa_ans_.json 
     --ckipdata=./ckipdata 
     --cwngit=./CwnGraph 
     --cwn_py=./cwn_graph.pyobj 
     --output=./out.json 
     --num_aug=5     # 5x augmented+1x original
     --alpha_sr=0.1  # synonym_replacement 
     --alpha_ri=0.1  # random insertion (with synonyms)
     --alpha_rs=0.1  # random swap
     --alpha_rd=0.1  # random deletion
     --seed=0        # recommend 1126 lol 
     --save_synonyms=0 # if you want to output a synonym dictionary of the synonyms searched or used, turn it to 1 
 

ctatool's People

Contributors

nana2929 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.