thu-coai / cotk Goto Github PK

Conversational Toolkit. An Open-Source Toolkit for Fast Development and Fair Evaluation of Text Generation

License: Apache License 2.0

Python 95.58% Jupyter Notebook 4.42%

machine-learning natural-language-processing natural-language-generation deep-learning python data-processing text-data cotk metrics

cotk's Issues

[Feature] Report system

Write a script that push results to dashboard

Command:
'''
cotk-report [--result result.json] [--only-upload] [--entry main] [other parameter]
'''
result: indicates the test results.
only-upload: indicates push results without running model
entry: means the entry point of models

If running in only upload, the result should be comparable
If runing in full mode, the result can reproducible

Provide a list of api for dashboard

[Model] SeqGAN

Refer to SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient

[Enhancement] update seq2seq-tensorflow

pump up to a newer tensorflow version
fix all the warnings in test

[Dataloader] Multiturn Dialog

Ubuntu Dialog Corpus

refer to http://dataset.cs.mcgill.ca/ubuntu-corpus-1.0/

hred model need refactor

[Enhancement] enhance metric test

https://github.com/thu-coai/contk/blob/e6e3d641766e4ae2111f41e742be206cc8684d2c/tests/dataloader/test_metric.py#L61-L76

make the length of sentence different in different turns.
make the length of turn different in different batches

[Enhancement] fill unittest

TODO: Some small test

[Enhancement] change api in language_generation::get_batch

change sentence to sent?

change all sen to sent

Need recheck

waiting #107

[Maintenance] update glove for better path

[Maintenance] Rename BaseLanguageGeneration?

BaseLanguageGeneration is easily confused with LanguageGeneration
Change LanguageGeneration -> LanguageModeling ?

[pytorch_modules] decoder

Build framework

[BUG] fix hred test

Describe the bug
hred test is wrong.

Why the turn of generated sentences > turn of reference ???

[Maintenance] Fix docs and add hints for dataloader

[Maintenance] Add test for file_utils and resource_processor

[Enhancement] unittest & code coverage

[Enhancement] assert metrics don't have side effects

Metric forward function shouldn't change any inputs.

Add asserts into unittest.

[Feature] Rename fileutils to Downloader

Give a url and return a local cached path.

Put in cotk.downloader instead of _utils

[Enhancement] Adapt test for metric using allvocabs

Description:

Now dataloader have added new attributes: valid vocabs and invalid vocabs
valid vocabs mean the vocabularies used by models
all vocabs(== valid vocabs + invalid vocabs) mean the vocabularies used by metrics.
If a word is not any kind of all vocabs, it is unkown vocabs, which are ignored by metrics.

Metric unittest must be adapted for new metrics.

Requirements:

Pull invalid_vocab branch
FakeDataloader should have new attributes like all_vocab_size, ...
Bleu & Recorder metrics have to use all vocabs
Perplexity used a smoothing algorithm (You can see the code in PerlplexityMetric as reference):
- If models predict valid vocabs, perplexity is calculated as it was
- If models predict UNK, the probability is divided evenly to invalid vocabs
- If the reference is UNK, the word is ignored.
  So, you have to write tests for the new PerplexityMetric and MultiturnPerplexityMetric
  Try to cover the 3 conditions above.

[Model] RL for Dialogue

Refer to Deep Reinforcement Learning for Dialogue Generation.

[Enhancement] download module

Requirements:

Download data from net (Use cache)
Import data from local system (label it by hash value)
Config in json type
Put it in ./contk/_utils/file_utils.py

Code refer to
https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/file_utils.py

Pay attention to Apache LICENSE

[Models] CopyNet

Refer to Incorporating Copying Mechanism in Sequence-to-Sequence Learning.

[Maintenance] Refactor metric.py

split it to multiple files

[Enhancement] Make models docs

Requirements:

Turn models README.md to a rst file in docs

[Model] HRED

Refer to Building end-to-end dialogue systems using generative hierarchical neural network models

[Model] CVAE

Refer to Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders

[BUG] bug in trim_index of dataloader

Describe the bug
trim_index will get error when wasn't showed up:

IndexError: list index out of range

Expected behavior
don't trim when is not met

Additional context
https://github.com/thu-coai/contk/blob/7e41e43d5eb4af4881bb5e61a338025ab9f77858/contk/dataloader/language_generation.py#L198-L220

and in other dataloaders.

Consider modify index_to_sen behavior. Because there won't be but pad sometimes

[Maintenance] Change contk to cotk

[Enhancement] metric add hints, docs & examples

Add hints for invalid input in metric.py

For example, missing start token or end token.

Reorganize docs for metric.py

[Maintenance] Add metric hash to hash recorder

Metric may have some information to reference. To make sure it is unique, put it in hash.

Eg: self-bleu need whole test set for unique.

[BUG] MultiTurnPerplexityMetric

Describe the bug
https://github.com/thu-coai/contk/blob/e6e3d641766e4ae2111f41e742be206cc8684d2c/contk/metric/metric.py#L147-L150

Use multiturn as batch_size in sub_metric

Expected behavior
Add some comments to explain this.

[Enhancement] Refactor request

Refactor to eliminate duplicate

[Enhancement] gather download links of data

Gather the download links of data, make a 'dataset_config.json' in ./contk/dataloader

{
"MSCOCO": "https://XXXX"
}

It is best reference from the original link, can use gzip or other compressed format.

[Enhancement] Vocab List in Dataloader

For implemention of #8 copynet, dataloader should change behaviours.

In our mind, there should be 3 vocab list:

For model trainning, smallest. Only include words from train set. Call it set $V.
For metric, bigger. The model will be evaluated on this vocab list, including words from train set and test set. Call it set $M. But almostly all models can't generate words from $V-$M, because they haven't seen these. Howerver, copyNet can gen words from $V-$M by copy mechanism. It's necessary to take these words into accounts when we implement metrics. $V-$M can be expressed as UNK token for some models. Dataloader have to tranlate them into a uniform distribution on $V-$M.
The whole space of word, include not seen in all the data. Call it set $N. The words in $N-$M, we don't care about them, ignore them in evaluating models, as #37 . $N-$M is the TRUE UNK.

Require:

Change the behavior of dataloader, metric.

[Maintenance] Refactor dataloader of SwitchBoard

_build_vocab has to use multi_ref data
renamed to inference metric. embedding should have a default realization (use wordvec from Glove)
add unittest for unique feature of SwitchBoard

add hashvalue

[Enhancement] Make unit test for models

Requirement

Run models test only in cpu mode
Just check the arguments and the connection with the main library
Don't need to check performance
make the test standalone, because it may need packages like tensorflow or pytorch.

[Feature] Use a stable link on github for data

User may use same id to download same data from different sources:

like “glove” default from github
"glove~github" explicit from github
"glove~tsinghua" explicit download from coai.tsinghua

[BUG] bleu will crash

Describe the bug
BleuMetric will crashed when len(hypothesis) == 1?
possible because of smoothingFunction?

It's an upstreaming bug, just comment and give up

To Reproduce

checked

[Enhancement] Metrics check whether models use the same data

Problems

It may be hard to evaluate 2 models using the same test data in the same way.
So it's important to make the metrics be able to telling which data is used.

Proposal A

Make metrics binding the dataloader. Data must be processed in the same order.

Drawback: