Git Product home page Git Product logo

dialog-eval's Introduction

dialog-eval · twitter

Paper Poster Code1 Code2 documentation blog
A lightweight repo for automatic evaluation of dialog models using 17 metrics.

Features

🔀   Choose which metrics you want to be computed
🚀    Evaluation can automatically run either on a response file or a directory containing multiple files
💾   Metrics are saved in a pre-defined easy to process format
⚠️   The program warns you if some files required to compute specific metrics are missing

Metrics

  • Response length: Number of words in the response.
  • Per-word entropy: Probabilities of words are calculated based on frequencies observed in the training data. Entropy at the bigram level is also computed.
  • Utterance entropy: The product of per-word entropy and the response length. Also computed at the bigram level.
  • KL divergence: Measures how well the word distribution of the model responses approximates the ground truth distribution. Also computed at the bigram level (with bigram distributions).
  • Embedding: Embedding average, extrema, and greedy are measured. average measure the cosine similarity between the averages of word vectors of response and target utterances. extrema constructs a representation by taking the greatest absolute value for each dimension among the word vectors in the response and target utterances and measures the cosine similarity between them. greedy matches each response token to a target token (and vica versa) based on the cosine similarity between their ebeddings and averages the total score across all words.
  • Coherence: Cosine similarity of input and response representations (constructed with the average word embedding method).
  • Distinct: Distinct-1 and distinct-2 measure the ratio of unique unigrams/bigrams to the total number of unigrams/bigrams in a set of responses.
  • BLEU: Measures n-gram overlap between response and target (n = [1,2,3,4]). Smoothing method can be choosen in the arguments.

Setup

Run this command to install required packages:

pip install -r requirements.txt

Usage

The main file can be called from anywhere, but when specifying paths to directories you should give them from the root of the repository.

python code/main.py -h

For the complete documentation visit the wiki.

Input format

You should provide as many of the argument paths required (image above) as possible. If you miss some the program will still run, but it will not compute some metrics which require those files (it will print these metrics). If you have a training data file the program can automatically generate a vocabulary and download fastText embeddings.

If you don't want to compute all the metrics you can set which metrics should be computed in the config file very easily.

Saving format

A file will be saved to the directory where the response file(s) is. The first row contains the names of the metrics, then each row contains the metrics for one file. The name of the file is followed by the individual metric values separated by spaces. Each metric consists of three numbers separated by commas: the mean, standard deviation, and confidence interval. You can set the t value of the confidence interval in the arguments, the default is for 95% confidence.

Results & Examples

Interestingly all 17 metrics improve until a certain point and then stagnate with no overfitting occuring during the training of a Transformer model on DailyDialog. Check the appendix of the paper for figures.

TRF is the Transformer model evaluated at the validation loss minimum and TRF-O is the Transformer model evaluated after 150 epochs of training, where the metrics start stagnating. RT means randomly selected responses from the training set and GT means ground truth responses.

Transformer trained on Cornell


TRF is the Transformer model, while RT means randomly selected responses from the training set and GT means ground truth responses. These results are on measured on the test set at a checkpoint where the validation loss was minimal.

Transformer trained on Twitter


TRF is the Transformer model, while RT means randomly selected responses from the training set and GT means ground truth responses. These results are on measured on the test set at a checkpoint where the validation loss was minimal.

Contributing

Check the issues for some additions where help is appreciated. Any contributions are welcome ❤️
Please try to follow the code syntax style used in the repo (flake8, 2 spaces indent, 80 char lines, commenting a lot, etc.)

New metrics can be added by making a class for the metric, which handles the computation of the metric given data. Check BLEU metrics for an example. Normally the init function handles any data setup which is needed later, and the update_metrics updates the metrics dict using the current example from the arguments. Inside the class you should define the self.metrics dict, which stores lists of metric values for a given test file. The names of these metrics (keys of the dictionary) should also be added in the config file to self.metrics. Finally you need to add an instance of your metric class to self.objects. Here at initialization you can make use of paths to data files if your metric requires any setup. After this your metric should be automatically computed and saved.

However, you should also add some constraints to your metric, e.g. if a file required for the computation of the metric is missing the user should be notified, as here.

Authors

License

This project is licensed under the MIT License - see the LICENSE file for details.
Please include a link to this repo if you use it in your work and consider citing the following paper:

@inproceedings{Csaky:2019,
    title = "Improving Neural Conversational Models with Entropy-Based Data Filtering",
    author = "Cs{\'a}ky, Rich{\'a}rd and Purgai, Patrik and Recski, G{\'a}bor",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1567",
    pages = "5650--5669",
}

dialog-eval's People

Contributors

ricsinaruto avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

dialog-eval's Issues

Make the framework more modular

Currently every metric is tightly integrated. The framework should be more modular where new metrics can be easily added using a class template and some config parameters.

Vocab options

There should be an option to build vocab entirely from the vector vocab, without text vocab, and more flexible options in general.

How to use CoherenceMetrics?

After reading through the code:

  def update_metrics(self, resp_words, gt_words, source_words):
    '''
    Params:
      :resp_words: Response word list.
      :gt_words: Ground truth word list.
      :source_words: Source word list.
    '''
    avg_source = self.avg_embedding(source_words)
    avg_resp = self.avg_embedding(resp_words)

    # Check for zero vectors and compute cosine similarity.
    if np.count_nonzero(avg_resp) and np.count_nonzero(avg_source):
        self.metrics['coherence'].append(
          1 - distance.cosine(avg_source, avg_resp))

One could import from the code update_metrics function from CoherenceMetrics. Although the function receives resp_words, gt_words, source_words, it is not clear to me how to use the function. Could anybody provide a real example of how to call and use Coherence Metrics?

Optimize code

Currently when computing the metrics there are some things that run twice or that are not needed. This should be optimized better.

Provide more embedding options

Currently only fasttext is automatically downloaded, and only SIF average word embedding is used. We could use other word embeddings and sentence representations like BERT.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.