This repository contains code accompanying the paper "An End-to-End Chinese Text Normalization Model based on Rule-Guided Flat-Lattice Transformer" published on ICASSP 2022.
Python: 3.7.3
PyTorch: 1.2.0
FastNLP: 0.5.0
Numpy: 1.16.4
For more about FastNLP, please visit here.
We release a large-scale Chinese Text Normalization (TN) Dataset in corporatioin with Databaker (Beijing) Technology Co., Ltd.
To download the dataset, please visit https://www.data-baker.com/en/#/data/index/TNtts.
(For Chinese version of the download page, please visit https://www.data-baker.com/data/index/TNtts.)
The raw dataset in jsonl format are saved at :
dataset/cleaned_dataset_by_myself/CN_TN_epoch-01-28645_2.jsonl
The raw dataset are in jsonl format as follows:
Preprocessed data are saved at :
dataset/cleaned_dataset_by_myself/shuffled_BMES
The proposed data are in BMES format as follows:
We divided data into train
ใdev
ใtest
by 8:1:1.
You can also run our code to prepocess and divide the raw dataset again
python /dataset/cleaned_dataset_by_myself/get_json.py
You can run the following code to see the number of all NSW categories
python /dataset/cleaned_dataset_by_myself/sep_data.py
Our code are in version V1, run training code
python /V1/flat_main.py --dataset databaker
Our proposed rule base are saved in a python file:
/V1/add_rule.py