gmftbygmftby / multiturndialogzoo Goto Github PK

View Code? Open in Web Editor NEW

163.0 7.0 24.0 23.54 MB

Multi-turn dialogue baselines written in PyTorch

License: MIT License

Python 76.17% Shell 3.42% Jupyter Notebook 19.37% Perl 1.03%

hred wseq seq2seq recosa gcn gat hran hred-attn transformer dshred

multiturndialogzoo's Introduction

Multi-turn Dialog Zoo

A batch of ready-to-use multi-turn or single-turn dialogue baselines.

Welcome PRs and issues.

TODO

Memory Network
HVMN
Pure Transformer (in development, poor performance)
GAN-based multi-turn dialogue generation
RL-based fine-tuning dialogue models
Fix the architecture of the decoder (add the context vector $c$ and last token embedding $y_{t-1}$ for predicting $y_t$)

Dataset

The preprocess script for these datasets can be found under data/data_process folder.

DailyDialog dataset
Ubuntu corpus
EmpChat
DSTC7-AVSD
PersonaChat

Metric

PPL: test perplexity
BLEU(1-4): nlg-eval version or multi-bleu.perl or nltk
ROUGE-2
Embedding-based metrics: Average, Extrema, Greedy (slow and optional)
Distinct-1/2
BERTScore
BERT-RUBER

Requirements

Pytorch 1.2+ (Transformer support & pack_padded update)
Python 3.6.1+
tqdm
numpy
nltk 3.4+
scipy
sklearn (optional)
rouge
GoogleNews word2vec or glove 300 word2vec (optional)
pytorch_geometric (PyG 1.2) (optional)
cuda 9.2 (match with PyG) (optional)
tensorboard (for PyTorch 1.2+)
perl (for running the multi-bleu.perl script)

Dataset format

Three multi-turn open-domain dialogue dataset (Dailydialog, DSTC7_AVSD, PersonaChat) can be obtained by this link

Each dataset contains 6 files

src-train.txt
tgt-train.txt
src-dev.txt
tgt-dev.txt
src-test.txt
tgt-test.txt

In all the files, one line contain only one dialogue context (src) or the dialogue response (tgt). More details can be found in the example files. In order to create the graph, each sentence must begin with the special tokens <user0> and <user1> which denote the speaker. The __eou__ is used to separate the multiple sentences in the conversation context. More details can be found in the small data case.

How to use

Model names: Seq2Seq, SeqSeq_MHA, HRED, HRED_RA, VHRED, WSeq, WSeq_RA, DSHRED, DSHRED_RA, HRAN, MReCoSa, MReCoSa_RA
Dataset names: daildydialog, ubuntu, dstc7, personachat, empchat

0. Ready

Before running the following commands, make sure the essential folders are created:

mkdir -p processed/$DATASET
mkdir -p data/$DATASET
mkdir -p tblogs/$DATASET
mkdir -p ckpt/$DATASET

Variable DATASET contains the name of the dataset that you want to process

1. Generate the vocab of the dataset

# default 25000 words
./run.sh vocab <dataset>

2. Generate the graph of the dataset (optional)

# only MTGCN and GatedGCN need to create the graph
# zh or en
./run.sh graph <dataset> <zh/en> <cuda>

3. Check the information about the preprocessed dataset

Show the length of the utterances, turns of the multi-turn setting and so on.

./run.sh stat <dataset>

4. Train N-gram LM (Discard)

Train the N-gram Language Model by NLTK (Lidstone with 0.5 gamma, default n-gram is 3):

# train the N-gram Language model by NLTK
./run.sh lm <dataset>

5. Train the model on corresponding dataset

./run.sh train <dataset> <model> <cuda>

6. Translate the test dataset:

# translate mode, dataset dialydialog, model HRED on 4th GPU
./run.sh translate <dataset> <model> <cuda>

Translate a batch of models

# rewrite the models and datasets you want to translate
./run_batch_translate.sh <cuda>

7. Evaluate the result of the translated utterances

# get the BLEU and Distinct result of the generated sentences on 4th GPU (BERTScore need it)
./run.sh eval <dataset> <model> <cuda>

Evaluate a batch of models

# the performance are redirected into the file `./processed/<dataset>/<model>/final_result.txt`
./run_batch_eval.sh <cuda>

8. Get the curve of all the training checkpoints (discard, tensorboard is all you need)

# draw the performance curve, but actually, you can get all the information from the tensorboard
./run.sh curve <dataset> <model> <cuda>

9. Perturbate the source test dataset

Refer to the paper: Do Neural Dialog Systems Use the Conversation History Effectively? An Empirical Study

# 10 mode for perturbation
./run.sh perturbation <dataset> <zh/en>

Ready-to-use Models

Seq2Seq-attn: Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
Seq2Seq-MHA: Attention is All you Need. It should be noted that vanilla Transformer is very hard to obtain the good performance on these datasets. In order to make sure the stable performance, i leverage the multi-head self-attention (1 layer, you can change it) on the RNN-based Seq2Seq-attn, which shows the better performance.
HRED: Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models. Enhanced HRED with the utterance-level attention.
HRED-WA: Building the word-level attention on HRED model.
WSeq: How to Make Context More Useful? An Empirical Study on Context-Aware Neural Conversational Models
WSeq-WA: Building the word-level attention on WSeq model.
VHRED: A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues, without BOW loss (still in development, welcome PR)
DSHRED: Context-Sensitive Generation of Open-Domain Conversational Responses, dynamic and static attention mechanism on HRED
DSHRED-WA: Building the word-level attention on DSHRED
ReCoSa: ReCoSa: Detecting the Relevant Contexts with Self-Attention for Multi-turn Dialogue Generation. It should be noted that this implementation here is a little bit different from the original codes, but more powerful and practical (3 layer multi-head self-attention but only 1 layer in the orginal paper).
ReCoSa-WA: Building word-level attention on ReCoSa
HRAN: Hierarchical Recurrent Attention Network for Response Generation, actually it's the same as the HRED with word-level attention mechanism.

FAQ

multiturndialogzoo's People

Contributors

Stargazers

Watchers

multiturndialogzoo's Issues

_

When can models stop training?

I observed that when the Test PPL of the model stopped falling, the other test results would still rise. For example, when the epoch is 22, Valid Loss and Test PPL may be the lowest, but after the epoch is greater than 22, Bleu and Distinct results will continue to rise.
According to the experimental steps, theoretically, model parameters should be saved at the lowest point of the Valid dataset Loss. Then the model's evaluation results in Test Dataset can be regarded as the final representation. Is that right?

data/data_process 文件夹里貌似没有Ubuntu corpus的预处理脚本？

The preprocess script for Ubuntu corpus can not be found under data/data_process folder.

The performances when using ReCoSa

Hi, Thank you for publishing this great repository.
I opened this issue because I want to ask a question to you.

I'm currently trying to implement a multi-turn dialogue generation task using ReCoSa structure.
I coded the entire structure by referring the paper and I combined 4 datasets, DailyDialog, PersonaChat, EmpatheticDialogue and BlendedSkillTalk.
But after training, I have not been able to get satisfactory results since it makes the outputs with no meaning or severely repetitive words in the inference step.
I've been trying to improve it by changing the hyperparameter setting several times and training again and gain, but I still can't get good results.

I want to know why mine is not working...maybe due to wrong hyperparameter settings, problems with implementation itself, or lack of data...
The most likely problem to me at this moment is the sequence length since I've set it very long at 300, so I think there will be difficulties in encoding an utterance but I'm not sure.

So I wonder if you got decent qualities with ReCoSa model, not the automatic scores but the actual engaging conversations. (even if your structure seems slightly different from the original version in the paper...)
As I look at your codes, my model has bigger dimensions and complexity, but I don't think this is the cause since even the overfitting is not happening.

Please tell me what your results looked like. That will be really grateful.
Thank you.

使用中文数据集

你好，感谢你的工作，这对我的帮助很大。我想将这些模型使用到中文数据集上，只需要按照英文数据集的格式，然后使用jieba这种分词工具进行分词就可以了吧？

The result of Seq2Seq

Thanks for your amazing work ! Recently, i modify seq2seq-attn model based on Pytorch Chatbot Tutorial and train a baseline for DailyDialog, but i can't train it successfully no matter how i change the parameters, and i use your code, but still get a poor result as follows, can you give me a help ?

Which one is the best one?

Hi! Thank you for your work on this repo.

So after all your testing, what is the best architecture in terms of quality of generation?

Performance on DailyDialog dataset

Hi,
I tried running the Seq2Seq and HRED models on dialydialog dataset. Here are the results I got:

Model Seq2Seq Result
BLEU-1: 0.215
BLEU-2: 0.0986
BLEU-3: 0.057
BLEU-4: 0.0366
ROUGE: 0.0492
Distinct-1: 0.0268; Distinct-2: 0.131
Ref distinct-1: 0.0599; Ref distinct-2: 0.3644
BERTScore: 0.1414

Model HRED Result
BLEU-1: 0.2121
BLEU-2: 0.0961
BLEU-3: 0.0542
BLEU-4: 0.0331
ROUGE: 0.0502
Distinct-1: 0.0208; Distinct-2: 0.0992
Ref distinct-1: 0.0588; Ref distinct-2: 0.3619
BERTScore: 0.1436

These results seem to be much lower than the ones reported in the dailydialog paper: https://www.aclweb.org/anthology/I17-1099.pdf
Do you have any clues on why is that the case?
Thanks!

The dataset only have 3 file instead 6 file.

I download the data from your link, but only 'dial.test' 'dial.train' 'dial.valid'.

baseline models

Hi, I found this repo is quite helpful for the dialog system research community. I was wondering are these baseline models provided in this repo ready to use or still in the development?
Thank you!

training epochs

Hi. First, thank you for your amazing work. Your codes inspire me a lot and make my re-producing process much more efficient.
There is a question about the training epoch. In the raw files, you set training epoch as 100. I wonder whether it is proper for all the implemented models?

关于ubuntu数据集，seq2seqNoAttention效果太好

我按照Daildydialog预处理那样处理了ubuntu，得到了3800000条数据，其中max_len按照给到的默认值（或平均值，效果都是seq2seqNoAttention> hred）。但是按照您的代码跑下的结果：
seq2seq:
BLEU-1:0.1361
BLEU-2:0.0493
BLEU-3:0.0221
BLEU-4:0.0108

hred:
BLEU-1:0.1307
BLEU-2:0.0486
BLEU-3:0.0214
BLEU-4:0.0108

取消attention，我是直接去了input_outputs[-1]。
请问您使用的时候遇到过这个问题吗？或者作者您认为原因是什么，期待您的回复~

output = output[:, :, :self.hidden_size] + output[:, :, self.hidden_size:]，why？

您好，看了您的代码之后对encoder部分的这个不是很明白，能方便解释一下这行代码的意义或者作用吗

_

An inquiry about ReCoSa model

Hi~Thank you for sharing such a helpful repo.
Now that I have a little confused about the ReCoSa model. In the original paper, the ReCoSa model is add a GRU encoder to transformer . But the code you provided seems like the major architecture is still RNN based model( the decoder is GRU).

RuntimeError: The size of tensor a (31) must match the size of tensor b (30) at non-singleton dimension 0

您好，很棒的工作！
我在用您的代码跑评测数据，数据集用的是dailydialog，模型选择的是Transformer
但是报了这样的错“RuntimeError: The size of tensor a (31) must match the size of tensor b (30) at non-singleton dimension 0
”
您知道怎么解决吗？