Git Product home page Git Product logo

conversationai-models's Introduction

ConversationAI Models

This repository is contains example code to train machine learning models for text classification as part of the Conversation AI project.

Outline of the codebase

  • experiments/ contains the ML training framework.
  • annotator-models/ contains a Dawid-Skene implementation for modelling rater quality to produce better annotations.
  • attention-tutorial/ contains an introductory ipython notebook for RNNs with attention, as presented at Devoxx talk "Tensorflow, deep learning and modern RNN architectures, without a PhD by Martin Gorner"
  • kaggle-classification/ early experiments with Keras and Estimator for training on the Jigsaw Toxicity Kaggle competition. Will be superceeded by experiments/ shortly.
  • model_evaluation/ contains utilities to use a model deployed on cloud MLE, and some notebooks to illustrate typical evaluation metrics.

About this code

This repository contains example code to help experiment with models to improve conversations; it is not an official Google product.

conversationai-models's People

Contributors

dborkan avatar deeglaze avatar dependabot[bot] avatar ericagreene avatar fprost avatar ianchou avatar iislucas avatar ipavlopoulos avatar jetpack avatar jjtan avatar lucyvasserman avatar msushkov avatar nthain avatar sorensenjs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

conversationai-models's Issues

Add documentation for downloading Glove embeddings

Currently, the Keras model training pipeline assumes you have a glove embeddings file in /local_data/glove.6B/glove.6B.100d.txt. It doesn't look like we have any documentation about were to get those embeddings and where to put them.

We should:

  • add documentation to the README
  • add logic to our run_local shell scripts that that automatically downloads them from Cloud Storage if they don't exist and puts them in the right place [optional, but would be super great]

Include example data to run dawid-skene locally

In the spirit of open source, it would be nice to include some sample data that people can use to run the dawid-skene code out of the box. I think a subset of the wikipedia data used for the Kaggle competition would be good.

Automatically detect the model signature

In the Model implementation, we do not guarantee that the model arguments are compatible with the CMLE model (in particular the signature of this model). Errors are spotted when collecting predictions after a batch prediction job.

It seems that this script has a potential solution (look for "(Optional) Inspect the model binaries with the SavedModel CLI ").

It would be a nice feature to help the user initialize a Model instance.

Error running tf_gru_attention on many_communities data

When running the tf_gru_attention model on the many communities data, the graph runs for a few hundred steps before failing with the following error

tensorflow.python.framework.errors_impl.InvalidArgumentError: 0-th value returned by pyfunc_0 is double, but expects int64
[[Node: PyFunc = PyFuncTin=[DT_STRING], Tout=[DT_INT64], token="pyfunc_0", _device="/device:CPU:0"]]
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?], [?], [?,?], [?]], output_types=[DT_FLOAT, DT_INT32, DT_INT64, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

The script used to run this model was:

GCS_RESOURCES="gs://kaggle-model-experiments/resources"

python -m tf_trainer.tf_gru_attention.run
--train_path="${GCS_RESOURCES}/transfer_learning_data/many_communities/20181105_train.tfrecord"
--validate_path="${GCS_RESOURCES}/transfer_learning_data/many_communities/20181105_validate.tfrecord"
--embeddings_path="${GCS_RESOURCES}/glove.6B/glove.6B.100d.txt"
--model_dir="tf_gru_attention_local_model_dir"
--labels="removed"
--label_dtypes="int"

Tensorflow js conversion blocked by unsupported ops

The char model when converted

tensorflowjs_converter --output_node_names='frac_neg/predictions/probabilities' --input_format=tf_saved_model experiments/tf_char
_cnn_local_model_dir/100000/1545431873/ experiments/tf_char_cnn_local_model_dir/100000/tfjs/

produces the following:

ValueError: Unsupported Ops in the model before optimization
DecodeRaw, ParseExample, StringSplit

Compare attention models

Compare the effectiveness of CNN with attention against LSTM with attention. Metrics for comparison could include:

  • Overall ROC-AUC score
  • Some qualitative measure of the returned attention weights

Create a token/embedding creation preprocessing pipeline using tf-transform

Issue:
We currently depend on vocabularies, like glove embeddings, that are:

  1. Weirdly biased (although when you backprop to the embeddings, their initial bias is not very relevant anymore),
  2. Depend on being consistent with the tokenizer we use.
  3. Don't necessarily have the same words as our actual text.

Proposed solution project:
Use https://github.com/tensorflow/transform to develop text preprocessing pipelines, e.g. to select tokens that occur sufficiently frequently, and create either random or smarter word embeddings for them.

Update experiments README for framework changes

Hi Flavien,

Given the fantastic changes to the framework you've made to ease training/deployment and evaluation of models, would you have time to take a look at the README file in the experiments/ folder and update that to help others use the codebase.

Thanks!

tf_trainer.common.tfrecord_input_test is broken

This test currently fails due to:
Traceback (most recent call last):
File "/usr/local/google/home/sorenj/github/conversationai-models/experiments/tf_trainer/common/tfrecord_input_test.py", line 75, in test_TFRecordInput_rounded
round_labels=True)
TypeError: init() got an unexpected keyword argument 'feature_preprocessor_init'

but the problem is deeper than just a change in parameter name.

The test begins failing with commit 2a08943

Write out the results in the correct format for the Kaggle competition

The Kaggle competition requires the submissions be formatted like this:

id,toxic,severe_toxic,obscene,threat,insult,identity_hate
6044863,0.5,0.5,0.5,0.5,0.5,0.5
6102620,0.5,0.5,0.5,0.5,0.5,0.5
14563293,0.5,0.5,0.5,0.5,0.5,0.5
21086297,0.5,0.5,0.5,0.5,0.5,0.5

We're not actually competing in the competition, but it would be good to output our predictions in the same format so we can test our scoring scripts.

Try data cleaning heuristics

These are not elegant, but might help.

  1. Write some curse word regexs that would convert, for example, 'f u c k' --> 'fuck' and f*ck' --> 'fuck'
  2. Identify URLs
  3. Use a fancier tokenizer

Text input based models should optionally not parse examples

For simplicity, instead of taking a TF Example with a single text feature, we should just input the tensor directly. This means the tf hub and char models should use

tf.estimator.export.build_raw_serving_input_receiver_fn

in place of the

build_parsing_serving_input_receiver_fn

which would eliminate at least one of the blocking ops for tensorflowjs. #222

BUG: run.ml_engine.sh seem to fail

Recent attempts to train the cloud keras_gru_attention model via run.ml_engine.sh have been failing with the following error:

Non-OK-status: status_ status: Failed precondition: could not dlopen DSO: libcupti.so.9.0; dlerror: libcupti.so.9.0: cannot open shared object file: No such file or directory

It is unclear when this bug began but a keras_gru_attention model was successfully trained on July 12, 2018.

Write out evaluation stats to model directory

It would be nice if each directory that contains a SavedModel from one training run also included a JSON (?) file with the accuracy, AUC, FPR, FNR etc. for the model on the held-out data.

Converter to tf.record improvements

  • The current CSV to tf.record converter has hard-wired field selectors. These should be specified on the command line.
  • CSV is a bad format: there are multiple usually incompatible and badly supported 'standards'. If people use CSV, print out a warning, that CSVs are fragile, and the user should consider a more robust format like jsonlines, or json.
  • Support jsonlines, and json input, and when we do that, maybe rename script appropriately.
  • Consider: make our dataset class natively support and convert examples inline instead of requiring pre-processing.

Support exporting models

Estimator uses the most recent model by default, see: https://www.tensorflow.org/get_started/checkpoints ; Note that while checkpoints store model weights, the whole graph + weights (aka models) can be restored - this looks like the right abstraction, and may obviate the need for build_parsing_serving_input_receiver_fn, which exports a model that takes TF.Example proto as input.

Something like (Thanks to @dborkan for the pointers!):

feature_spec = {  'sentence': tf.FixedLenFeature(dtype=tf.string, shape=1)}
serving_input_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec)
# Note: `estimator` below is an instance of the TF Estimator class.estimator.export_savedmodel(<destination_directory>, serving_input_fn)

This seem to fit naturally into the base_model.py abstraction. To be figured out: what's the right way to specify the appropriate checkpoint to use?

Add ability for evaluation script to load a Keras model

We currently have a script that can load a SavedModel object from an Estimator model and use it to evaluate new data. This involves loading a saved VocabularyProcessor to pre-process new data, loading the SavedModel and running the new data through the model.

We'd like to add similar functionality for a Keras model. This will mean:

  • saving the Keras tokenizer in the model directory during a training run
  • loading a saved Keras model
  • preprocessing new data using the tokenizer
  • running new data through the loaded Keras model

train_and_eval correct behavior

Was using tensorboard and getting fairly erratic eval points when training my model (e.g. one at 500 steps, next one at 7k steps). Am I misinterpreting our train_with_eval function?

If not, given that tf's train_with_eval preserves where it left off (as of tensorflow/tensorflow#19062 (comment)), can we stick it in a loop so we can get more eval points? I wrote some code that seems to work in this branch: https://github.com/conversationai/conversationai-models/tree/train-eval-loop

But may be missing something with how checkpoints interact with when evaluation happens.

Add Kaggle Competition Winner's Models

Since we have access to the code from the Winner's of the Kaggle competition, let's try to add their models to this framework. This will also test our ability to build a framework that is robust to quickly incorporating models from external sources.

Add checkpoints to dawid-skene trainer

The Dawid Skene training pipeline currently doesn't write out any checkpoints, so you need to wait until training has finished before checking the results. And something could fail at the end and you'd lose all the results. Not ideal.

Compare Kaggle model results against Perspective API scores

As a way to further evaluate these models, it would be nice to have a flag that will score a subset of the test data using the Perspective API. I'm imagining outputting results that have

  • comment_id
  • comment_text
  • y_class (e.g. 'toxic', 'obscene' etc.)
  • y_gold (if available)
  • y_prob (e.g. .89, 0.03 etc.)
  • perspective_api_prob
  • |y_prob - perspective_api_prob|

Allow training on larger datasets

Determine what must change with the pipeline to allow training on datasets that don't fit in memory. Test this out with available Perspective datasets.

Create some simple integration tests

Examples:

  • Simple text-based word finder, and xor encoded problem (e.g. find "cat" or "dog" in string, but not both)
  • Training and metrics work.

Write subset of predictions to HTML file

Reading CSV files is tough, but it's often useful to look through the test data and predictions beyond just looking at the accuracy metrics. One solution is to write a sample of the predictions in a HTML format that we can add some basic styling to so it's easy to read. That way we can go from new model -> analyzing results really quickly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.