conversationai / conversationai-models Goto Github PK

A repository to house model building experiments and tools that are part of the Conversation AI effort.

License: Apache License 2.0

Shell 4.84% Python 34.30% Jupyter Notebook 60.65% Starlark 0.22%

conversationai-models's Introduction

ConversationAI Models

This repository is contains example code to train machine learning models for text classification as part of the Conversation AI project.

Outline of the codebase

experiments/ contains the ML training framework.
annotator-models/ contains a Dawid-Skene implementation for modelling rater quality to produce better annotations.
attention-tutorial/ contains an introductory ipython notebook for RNNs with attention, as presented at Devoxx talk "Tensorflow, deep learning and modern RNN architectures, without a PhD by Martin Gorner"
kaggle-classification/ early experiments with Keras and Estimator for training on the Jigsaw Toxicity Kaggle competition. Will be superceeded by experiments/ shortly.
model_evaluation/ contains utilities to use a model deployed on cloud MLE, and some notebooks to illustrate typical evaluation metrics.

About this code

This repository contains example code to help experiment with models to improve conversations; it is not an official Google product.

conversationai-models's People

Contributors

Stargazers

Watchers

conversationai-models's Issues

Add documentation for downloading Glove embeddings

Currently, the Keras model training pipeline assumes you have a glove embeddings file in /local_data/glove.6B/glove.6B.100d.txt. It doesn't look like we have any documentation about were to get those embeddings and where to put them.

We should:

add documentation to the README
add logic to our run_local shell scripts that that automatically downloads them from Cloud Storage if they don't exist and puts them in the right place [optional, but would be super great]

Include example data to run dawid-skene locally

In the spirit of open source, it would be nice to include some sample data that people can use to run the dawid-skene code out of the box. I think a subset of the wikipedia data used for the Kaggle competition would be good.

Automatically detect the model signature

In the Model implementation, we do not guarantee that the model arguments are compatible with the CMLE model (in particular the signature of this model). Errors are spotted when collecting predictions after a batch prediction job.

It seems that this script has a potential solution (look for "(Optional) Inspect the model binaries with the SavedModel CLI ").

It would be a nice feature to help the user initialize a Model instance.

Error running tf_gru_attention on many_communities data

When running the tf_gru_attention model on the many communities data, the graph runs for a few hundred steps before failing with the following error

tensorflow.python.framework.errors_impl.InvalidArgumentError: 0-th value returned by pyfunc_0 is double, but expects int64
[[Node: PyFunc = PyFuncTin=[DT_STRING], Tout=[DT_INT64], token="pyfunc_0", _device="/device:CPU:0"]]
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?], [?], [?,?], [?]], output_types=[DT_FLOAT, DT_INT32, DT_INT64, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

The script used to run this model was:

GCS_RESOURCES="gs://kaggle-model-experiments/resources"

python -m tf_trainer.tf_gru_attention.run
--train_path="${GCS_RESOURCES}/transfer_learning_data/many_communities/20181105_train.tfrecord"
--validate_path="${GCS_RESOURCES}/transfer_learning_data/many_communities/20181105_validate.tfrecord"
--embeddings_path="${GCS_RESOURCES}/glove.6B/glove.6B.100d.txt"
--model_dir="tf_gru_attention_local_model_dir"
--labels="removed"
--label_dtypes="int"

Add tensorboard instructions to all models

See the run.sh in keras_cnn for an example

Tensorflow js conversion blocked by unsupported ops

The char model when converted

tensorflowjs_converter --output_node_names='frac_neg/predictions/probabilities' --input_format=tf_saved_model experiments/tf_char
_cnn_local_model_dir/100000/1545431873/ experiments/tf_char_cnn_local_model_dir/100000/tfjs/

produces the following:

ValueError: Unsupported Ops in the model before optimization
DecodeRaw, ParseExample, StringSplit

Compare attention models

Compare the effectiveness of CNN with attention against LSTM with attention. Metrics for comparison could include:

Overall ROC-AUC score
Some qualitative measure of the returned attention weights

Explore extending TFRecordInputReader to other data structures

@sorensenjs points out that tf Datasets can read a wide variety of formats. It may be too restrictive to only read in TFRecordFormat.

Translate model_evaluation and data_preparation to new project

Work should be done after the current changes in the notebook

Training TF Hub model past 10K steps causes several AUCs to drop

It's highly repeatable, and I suspect that because 10K * batch size is approximately the input size, we are having some bad behavior at the repeat() time. Not sure how to test, maybe with a no-op trainer.

Create a token/embedding creation preprocessing pipeline using tf-transform

Issue:
We currently depend on vocabularies, like glove embeddings, that are:

Weirdly biased (although when you backprop to the embeddings, their initial bias is not very relevant anymore),
Depend on being consistent with the tokenizer we use.
Don't necessarily have the same words as our actual text.

Proposed solution project:
Use https://github.com/tensorflow/transform to develop text preprocessing pipelines, e.g. to select tokens that occur sufficiently frequently, and create either random or smarter word embeddings for them.

Explore FiLM

FiLM: Visual Reasoning with a General Conditioning Layer
(https://arxiv.org/abs/1709.07871)

Save best model instead of most recent

Fix tensorboard on cloud

Currently, tensorboard works locally but runs into a 403 error when running on the cloud.

Update experiments README for framework changes

Hi Flavien,

Given the fantastic changes to the framework you've made to ease training/deployment and evaluation of models, would you have time to take a look at the README file in the experiments/ folder and update that to help others use the codebase.

Thanks!

JSONDecodeError in model evaluation

Reproduced in https://github.com/conversationai/conversationai-models/blob/msushkov_eval2/model_evaluation/model_eval.ipynb

Use convolutional filters of different sizes

Apparently it is common practice to use convolutional filters of different sizes, but the CNN implementation in this repo does not do this. So we should add it.

Separate hparam tuning into a different run-script for other models

See PR #127

Set up character level tokenization

tf_trainer.common.tfrecord_input_test is broken

This test currently fails due to:
Traceback (most recent call last):
File "/usr/local/google/home/sorenj/github/conversationai-models/experiments/tf_trainer/common/tfrecord_input_test.py", line 75, in test_TFRecordInput_rounded
round_labels=True)
TypeError: init() got an unexpected keyword argument 'feature_preprocessor_init'

but the problem is deeper than just a change in parameter name.

The test begins failing with commit 2a08943

Write out the results in the correct format for the Kaggle competition

The Kaggle competition requires the submissions be formatted like this:

id,toxic,severe_toxic,obscene,threat,insult,identity_hate
6044863,0.5,0.5,0.5,0.5,0.5,0.5
6102620,0.5,0.5,0.5,0.5,0.5,0.5
14563293,0.5,0.5,0.5,0.5,0.5,0.5
21086297,0.5,0.5,0.5,0.5,0.5,0.5

We're not actually competing in the competition, but it would be good to output our predictions in the same format so we can test our scoring scripts.

Using bazel for building/testing

See:
https://www.bazel.build/

Goal: make it easier to run all tests, setup CI testing, linting, etc.

Add LSTM with Attention

Error when processing int labels

Added tests in 33199cc

Try data cleaning heuristics

These are not elegant, but might help.

Write some curse word regexs that would convert, for example, 'f u c k' --> 'fuck' and f*ck' --> 'fuck'
Identify URLs
Use a fancier tokenizer

Explore Prototypical Networks

Prototypical Networks for Few-shot Learning
(https://arxiv.org/abs/1703.05175)

Text input based models should optionally not parse examples

For simplicity, instead of taking a TF Example with a single text feature, we should just input the tensor directly. This means the tf hub and char models should use

tf.estimator.export.build_raw_serving_input_receiver_fn

in place of the

build_parsing_serving_input_receiver_fn

which would eliminate at least one of the blocking ops for tensorflowjs. #222

BUG: run.ml_engine.sh seem to fail

Recent attempts to train the cloud keras_gru_attention model via run.ml_engine.sh have been failing with the following error:

Non-OK-status: status_ status: Failed precondition: could not dlopen DSO: libcupti.so.9.0; dlerror: libcupti.so.9.0: cannot open shared object file: No such file or directory

It is unclear when this bug began but a keras_gru_attention model was successfully trained on July 12, 2018.

Write out evaluation stats to model directory

It would be nice if each directory that contains a SavedModel from one training run also included a JSON (?) file with the accuracy, AUC, FPR, FNR etc. for the model on the held-out data.

Support Hyperparameter Tuning in Keras

Implement hyperparameter tuning in Keras

Converter to tf.record improvements

The current CSV to tf.record converter has hard-wired field selectors. These should be specified on the command line.
CSV is a bad format: there are multiple usually incompatible and badly supported 'standards'. If people use CSV, print out a warning, that CSVs are fragile, and the user should consider a more robust format like jsonlines, or json.
Support jsonlines, and json input, and when we do that, maybe rename script appropriately.
Consider: make our dataset class natively support and convert examples inline instead of requiring pre-processing.

Support exporting models

Estimator uses the most recent model by default, see: https://www.tensorflow.org/get_started/checkpoints ; Note that while checkpoints store model weights, the whole graph + weights (aka models) can be restored - this looks like the right abstraction, and may obviate the need for build_parsing_serving_input_receiver_fn, which exports a model that takes TF.Example proto as input.

Something like (Thanks to @dborkan for the pointers!):

feature_spec = {  'sentence': tf.FixedLenFeature(dtype=tf.string, shape=1)}
serving_input_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec)
# Note: `estimator` below is an instance of the TF Estimator class.estimator.export_savedmodel(<destination_directory>, serving_input_fn)

This seem to fit naturally into the base_model.py abstraction. To be figured out: what's the right way to specify the appropriate checkpoint to use?

Add ability for evaluation script to load a Keras model

We currently have a script that can load a SavedModel object from an Estimator model and use it to evaluate new data. This involves loading a saved VocabularyProcessor to pre-process new data, loading the SavedModel and running the new data through the model.

We'd like to add similar functionality for a Keras model. This will mean:

saving the Keras tokenizer in the model directory during a training run
loading a saved Keras model
preprocessing new data using the tokenizer
running new data through the loaded Keras model

Create a new tiny dataset for testing

As a new dataset class
For testing metrics and training locally, and framework for behavior.

train_and_eval correct behavior

Was using tensorboard and getting fairly erratic eval points when training my model (e.g. one at 500 steps, next one at 7k steps). Am I misinterpreting our train_with_eval function?

If not, given that tf's train_with_eval preserves where it left off (as of tensorflow/tensorflow#19062 (comment)), can we stick it in a loop so we can get more eval points? I wrote some code that seems to work in this branch: https://github.com/conversationai/conversationai-models/tree/train-eval-loop

But may be missing something with how checkpoints interact with when evaluation happens.

Add Kaggle Competition Winner's Models

Since we have access to the code from the Winner's of the Kaggle competition, let's try to add their models to this framework. This will also test our ability to build a framework that is robust to quickly incorporating models from external sources.

Parameterize 300 to be max sequence length in text_preprocessor.py

Explore MAML

Paper: Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (https://arxiv.org/abs/1703.03400)

Add checkpoints to dawid-skene trainer

The Dawid Skene training pipeline currently doesn't write out any checkpoints, so you need to wait until training has finished before checking the results. And something could fail at the end and you'd lose all the results. Not ideal.

Fix the accuracy metric for our use case

Compare Kaggle model results against Perspective API scores

As a way to further evaluate these models, it would be nice to have a flag that will score a subset of the test data using the Perspective API. I'm imagining outputting results that have

comment_id
comment_text
y_class (e.g. 'toxic', 'obscene' etc.)
y_gold (if available)
y_prob (e.g. .89, 0.03 etc.)
perspective_api_prob
|y_prob - perspective_api_prob|

Allow training on larger datasets

Determine what must change with the pipeline to allow training on datasets that don't fit in memory. Test this out with available Perspective datasets.

Use ml-engine's hyper-parameter tuning

Looks like ml-engine supports hyperparameter tuning. It would be great to integrate with that.

Docs: https://cloud.google.com/ml-engine/docs/hyperparameter-tuning-overview

Configure ml-engine to use GPUs

This should be a small code change, but involves reading the documentation on how to set this up in GCP. https://cloud.google.com/ml-engine/docs/using-gpus

It would also be great to add a flag to the python script to switch between GPU's and CPU's

Cleaning batch prediction utilities into a library

Code from this commit.

#137

Use TensorFlow's ClusterSpec library

The ClusterSpec class will allow us to switch between training on CPU and GPU. Right now we only train on CPU, but it should be easy to use GPU with ml-engine.

Docs: https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec

Create some simple integration tests

Examples:

Simple text-based word finder, and xor encoded problem (e.g. find "cat" or "dog" in string, but not both)
Training and metrics work.

Add comet.ml integration

https://www.comet.ml/

"Comet lets you track code, experiments, and results on ML projects. It’s fast, simple, and free for open source projects."

Looks cool and (they claim!) easy to set up.

Add Hierarchical Attention model

Build a hierarchical attention model using this pipeline. Reference: https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf

Write subset of predictions to HTML file

Reading CSV files is tough, but it's often useful to look through the test data and predictions beyond just looking at the accuracy metrics. One solution is to write a sample of the predictions in a HTML format that we can add some basic styling to so it's easy to read. That way we can go from new model -> analyzing results really quickly.