devinz1993 / chinese-poetry-generation Goto Github PK

View Code? Open in Web Editor NEW

249.0 249.0 105.0 34.75 MB

An RNN-based Chinese Poem Generator

License: MIT License

Python 100.00%

nlp-machine-learning tensorflow

chinese-poetry-generation's Introduction

Planning-based Poetry Generation

A classical Chinese quatrain generator based on the RNN encoder-decoder framework.

Here I tried to implement the planning-based architecture purposed in Wang et al. 2016, whereas technical details might be different from the original paper. My purpose of making this was not to refine the neural network model and give better results by myself. Rather, I wish to provide a simple framework as said in the paper along with convenient data processing toolkits for all those who want to experiment their ideas on this interesting task.

By Jun 2018, this project has been refactored into Python3 using TensorFlow 1.8.

Code Organization

The diagram above illustrates major dependencies in this codebase in terms of either data or functionalities. Here I tried to organize code around data, and make every data processing module a singleton at runtime. Batch processing is only done when the produced result is either missing or outdated.

Dependencies

Data Processing

Run the following command to generate training data from source text data:

./data_utils.py

Depending on your hardware, this can take you a cup of tea or over one hour. The keyword extraction is based on the TextRank algorithm, which can take a long time to converge.

Training

The poem planner was based on Gensim's Word2Vec module. To train it, simply run:

./train.py -p

The poem generator was implemented as an enc-dec model with attention mechanism. To train it, type the following command:

./train.py -g

You can also choose to train the both models altogether by running:

./train.py -a

To erase all trained models, run:

./train.py --clean

As it turned out, the attention-based generator model after refactor was really hard to train well. From my side, the average loss will typically stuck at ~5.6 and won't go down any more. There should be considerable space to improve it.

Run Tests

Type the following command:

./main.py

Then each time you type in a hint text in Chinese, it should return a kind of gibberish poem. It's up to you to decide how to improve the models and training methods to make them work better.

Improve It

To add data processing tools, consider adding dependency configs into __dependency_dict in paths.py. It helps you to automatically update processed data when it goes stale.
To improve the planning model, please refine the planner class in plan.py.
To improve the generation model, please refine the generator class in generate.py.

chinese-poetry-generation's People

Contributors

Stargazers

Watchers

Forkers

shizhediao vvcepheia lawup ottffive shihuaxing waylandgod nanfengpo uniqxh clarencehoo cyzhangathit shytking mondon11 feifeibear gongxijun liuheng2cqupt menxia qqfly1to19 huasanyelao zwb195 yuchenhecs gaussic zhenzhensun haofengrushui204 popsuper1982 xiejianyun longsy316 mosincos hanjitao xqwei maxy218 jameslei1024 jseam2 miraclemin fengfenggf lavinathong luluyuandata zhengjunzhao1991 swordsmanxyz lin0244 limingdeng gucasdongzi hangerzou matthewok fancycheung kangkanglee zbxzc35 swordencao linlingting gbusr wangxiaocao leishenvictoria wangmingxjtu zhujiahui pyli rular099 killf zheng5yu9 yespon greengrass2015 caoyuji1986 liboai songboriceboy wlybug barrcuda zhoubutong bzqweiyi chengli0327 aidreamwin weilongyitian rsev7 see-u-see liuhongbing1220 caleb66666 csarami ilax1 finch0001 rebeccapang yingchaojiefeng croyyin feng258866 riooooorio yaokee linktsang nlpprojectpku nyoicsnxo 786440445 mingyuedeng haiming2019 h-waves chenny0808 elorole dustyymelody7 lizhe502 larrylin feng-xiaohan xu-520 minovoo ernestothesecond mildaae superhsl

chinese-poetry-generation's Issues

_get_rhyme function cannot handle 'N' or 'G'

Current implementation cannot handle words end with 'N', 'G'.
Because _get_vowel function would return "" when pinyin ends with 'IANG', thus _get_rhyme would not ever return 10.

might change to this:

def _get_vowel(pinyin):
    i = len(pinyin) - 1
    if pinyin.endswith('N'):
        i -= 1
    if pinyin.endswith('NG'):
        i -= 2
    while i >= 0 and \
            pinyin[i] in ['A', 'O', 'E', 'I', 'U', 'V']:
        i -= 1
    return pinyin[i+1 : ]

Is it possible to convert this model to tensorflow lite format?

is there any reference documents or papers about the model in this project ?

Hi,
this is an very interesting project. However, I can't follow the whole model just according the source code. So is there any reference documents or papers about the model in this project ?

How do i do it for english

How can I train it to generate english poems?

what's the source of qtais.txt?

I know the meaning of corpus following:
qts : 全唐诗
qsc : 全宋诗
qsc : 全宋词
yuan ：元朝诗歌
ming：明代诗歌
qing: 清代诗歌
Am I correct?
And what's the meaning of qtais.txt?

thanks!

Where does source data come from?

Hi Devin,
Thank you for your amazing job, I've learned a lot from this.
I am just not sure where does this raw data come from. I think all poems data might come from Zhang and Lapata (EMNLP, 2014). How about pinyin dictionary?

why is that the generated poetry doesn't contain the keywords？

why is that the generated poem doesn't contain the keywords？
Input Text: 春天桃花开了
Keywords: 桃花开相宜春天江边
Poem Generated:
梅花香满绿荫时，石木无春午六滨。
江水东归灯下望，江南车径与谁槟。
In the paper “ChinesePoetryGenerationwithPlanningbasedNeuralNetwork”，it seems that the generated poem contains the keywords.

Thank you for your contribution and have two requests

hi，author：
this is good job，thank you for your share ! I have two request：
can you explain the process of keyword extraction , key word expansion and generation based on keyword?
In addition，do you have the code of Zhe Wang et al. Chinese Poetry Generation with Planning based Neural Network. 2016 ?
looking forward to your reply，thanks！

where can i download .txt files?

i have some errors that i don't have .txt files needed in codes

local variable 'data' referenced before assignment

Hello,
When I run '$python data_utils.py', it gives the error message like that:

File "corpus.py", line 63, in get_all_corpus
corpus.extend(data)
UnboundLocalError: local variable 'data' referenced before assignment

My python version is 2.7.10.
I don't know where the problem is.

data_utils.py failed with error

Hi, I was trying out the code after reading the paper. It seems like some of the data processing step failed. After running data_utils, there was only one output (sxhy_dict.txt) in the data folder, and gave the following error:

(tensorflow2) Vera-MacBook-Pro:Chinese-Poetry-Generation Vera$ python data_utils.py

Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/rj/yxfk4xl915l_0drx9d_stdgh0000gn/T/jieba.cache
Loading model cost 1.845 seconds.
Prefix dict has been built succesfully.
Generating the vocabulary ...
Parsing raw/qts_tab.txt ...
Traceback (most recent call last):
  File "data_utils.py", line 127, in <module>
    train_data = get_train_data()
  File "data_utils.py", line 72, in get_train_data
    _gen_train_data()
  File "data_utils.py", line 34, in _gen_train_data
    poems = get_pop_quatrains()
  File "/Users/Vera/Documents/projects/ura/Chinese-Poetry-Generation/cnt_words.py", line 40, in get_pop_quatrains
    cnts = get_word_cnts()
  File "/Users/Vera/Documents/projects/ura/Chinese-Poetry-Generation/cnt_words.py", line 27, in get_word_cnts
    _gen_word_cnts()
  File "/Users/Vera/Documents/projects/ura/Chinese-Poetry-Generation/cnt_words.py", line 14, in _gen_word_cnts
    quatrains = get_quatrains()
  File "/Users/Vera/Documents/projects/ura/Chinese-Poetry-Generation/quatrains.py", line 21, in get_quatrains
    _, ch2int = get_vocab()
  File "/Users/Vera/Documents/projects/ura/Chinese-Poetry-Generation/vocab.py", line 31, in get_vocab
    _gen_vocab()
  File "/Users/Vera/Documents/projects/ura/Chinese-Poetry-Generation/vocab.py", line 15, in _gen_vocab
    corpus = get_all_corpus()
  File "/Users/Vera/Documents/projects/ura/Chinese-Poetry-Generation/corpus.py", line 63, in get_all_corpus
    corpus.extend(data)
UnboundLocalError: local variable 'data' referenced before assignment

Below is the lists of packages installed:

# packages in environment at /anaconda/envs/tensorflow2:
#
boto                      2.46.1                    <pip>
bz2file                   0.98                      <pip>
funcsigs                  1.0.2                     <pip>
gensim                    2.1.0                     <pip>
jieba                     0.38                      <pip>
mock                      2.0.0                     <pip>
numpy                     1.12.1                    <pip>
openssl                   1.0.2k                        2  
pbr                       3.0.1                     <pip>
pip                       9.0.1                    py27_1  
pip                       9.0.1                     <pip>
protobuf                  3.3.0                     <pip>
python                    2.7.13                        0  
readline                  6.2                           2  
requests                  2.14.2                    <pip>
scipy                     0.19.0                    <pip>
setuptools                27.2.0                   py27_0  
six                       1.10.0                    <pip>
smart-open                1.5.3                     <pip>
sqlite                    3.13.0                        0  
tensorflow                1.1.0                     <pip>
tk                        8.5.18                        0  
Werkzeug                  0.12.2                    <pip>
wheel                     0.29.0                   py27_0  
zlib                      1.2.8                         3

Any suggestion?