nelson-liu / paraphrase-id-tensorflow Goto Github PK

Various models and code (Manhattan LSTM, Siamese LSTM + Matching Layer, BiMPM) for the paraphrase identification task, specifically with the Quora Question Pairs dataset.

License: MIT License

Makefile 0.99% Python 97.83% Shell 1.18%

nlp machine-learning deep-learning tensorflow paraphrase-identification

paraphrase-id-tensorflow's Introduction

paraphrase-id-tensorflow

Various models and code for paraphrase identification implemented in Tensorflow (1.1.0).

I took great care to document the code and explain what I'm doing at various steps throughout the models; hopefully it'll be didactic example code for those looking to get started with Tensorflow!

So far, this repo has implemented:

A basic Siamese LSTM baseline, loosely based on the model in Mueller, Jonas and Aditya Thyagarajan. "Siamese Recurrent Architectures for Learning Sentence Similarity." AAAI (2016).
A Siamese LSTM model with an added "matching layer", as described in Liu, Yang et al. "Learning Natural Language Inference using Bidirectional LSTM model and Inner-Attention." CoRR abs/1605.09090 (2016).
The more-or-less state of the art Bilateral Multi-Perspective Matching model from Wang, Zhiguo et al. "Bilateral Multi-Perspective Matching for Natural Language Sentences." CoRR abs/1702.03814 (2017).

PR's to add more models / optimize or patch existing ones are more than welcome! The bulk of the model code resides in duplicate_questions/models

A lot of the data processing code is taken from / inspired by allenai/deep_qa, go check them out if you like how this project is structured!

Installation

This project was developed in and has been tested on Python 3.5 (it likely doesn't work with other versions of Python), and the package requirements are in requirements.txt.

To install the requirements:

pip install -r requirements.txt

Note that after installing the requirements, you have to download the necessary NLTK data by running (in your shell):

python -m nltk.downloader punkt

GPU Training and Inference

Note that the requirements.txt file specify tensorflow as a dependency, which is a CPU-bound version of tensorflow. If you have a GPU, you should uninstall this CPU TensorFlow and install the GPU version by running:

pip uninstall tensorflow
pip install tensorflow-gpu

Getting / Processing The Data

To begin, run the following to generate the auxiliary directories for storing data, trained models, and logs:

make aux_dirs

In addition, if you want to use pretrained GloVe vectors, run:

make glove

which will download pretrained Glove vectors to data/external/. Extract the files in that same directory.

Quora Question Pairs

To use the Quora Question Pairs data, download the dataset from Kaggle (may require an account). Place the downloaded zip archives in data/raw/, and extract the files to that same directory.

Then, run:

make quora_data

to automatically clean and process the data with the scripts in scripts/data/quora.

Running models

To train a model or load + predict with a model, then run the scripts in scripts/run_model/ with python <script_path>. You can get additional documentation about the parameters they take by running python <script_path> -h

Here's an example run command for the baseline Siamese BiLSTM:

python scripts/run_model/run_siamese.py train --share_encoder_weights --model_name=baseline_siamese --run_id=0

Here's an example run command for the Siamese BiLSTM with matching layer:

python scripts/run_model/run_siamese_matching_bilstm.py train --share_encoder_weights --model_name=siamese_matching --run_id=0

Here's an example run command for the BiMPM model:

python scripts/run_model/run_bimpm.py train --early_stopping_patience=5 --model_name=biMPM --run_id=0

Note that the defaults might not be ideal for your use, so feel free to turn the knobs however you like.

Contributors

Contributing

Do you have ideas on how to improve this repo? Have a feature request, bug report, or patch? Feel free to open an issue or PR, as I'm happy to address issues and look at pull requests.

Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- Original immutable data (e.g. Quora Question Pairs).
|
├── logs               <- Logs from training or prediction, including TF model summaries.
│
├── models             <- Serialized models.
|
├── requirements.txt   <- The requirements file for reproducing the analysis environment
│
├── duplicate_questions<- Module with source code for models and data.
│   ├── data           <- Methods and classes for manipulating data.
│   │
│   ├── models         <- Methods and classes for training models.
│   │
│   └── util           <- Various helper methods and classes for use in models.
│
├── scripts            <- Scripts for generating the data
│   ├── data           <- Scripts to clean and split data
│   │
│   └── run_model      <- Scripts to train and predict with models.
│
└── tests              <- Directory with unit tests.

paraphrase-id-tensorflow's People

Stargazers

Watchers

paraphrase-id-tensorflow's Issues

STSInstance.get_int_{word/char}_indices needs a docstring

I forgot to add a docstring for STSInstance.get_int_word_indices and STSInstance.get_int_char_indices, and they should both have one.

remove the overrides , otherwise it reports error

Traceback (most recent call last):
File "scripts/run_model/run_siamese_matching_bilstm.py", line 16, in
from duplicate_questions.data.instances.sts_instance import STSInstance
File "scripts/run_model/../../duplicate_questions/data/instances/sts_instance.py", line 13, in
class STSInstance(TextInstance):
File "scripts/run_model/../../duplicate_questions/data/instances/sts_instance.py", line 69, in STSInstance
@OVERRIDES
File "/home//f.local/lib/python3.6/site-packages/overrides/overrides.py", line 70, in overrides
method.name)
AssertionError: No super class method found for "words"

Make word vector initialization random uniform

Right now, we're using the same initialization as in BiDAF, but we should change it to be more aligned with recent work on PPID/NLI/STS

Training Time Takes too Long (BiMPM Model)

Even I enable GPU computing (Nvidia tesla k80 X4) , it takes around 10 hours to finish 1 epoch.
Just wonder if this behavior is expected or not.

pretrained models

Hey,
I am having trouble in training the data. Could you please share the pre-trained model, that will be really helpful.
Thanks.

AssertionError:

Hi,
python scripts/run_model/run_siamese.py train --share_encoder_weights --model_name=baseline_siamese --run_id=0
Traceback (most recent call last):
File "scripts/run_model/run_siamese.py", line 14, in
from duplicate_questions.data.instances.sts_instance import STSInstance
File "scripts/run_model/../../duplicate_questions/data/instances/sts_instance.py", line 122, in
class IndexedSTSInstance(IndexedInstance):
File "scripts/run_model/../../duplicate_questions/data/instances/sts_instance.py", line 366, in IndexedSTSInstance
@OVERRIDES
File "/usr/lib/python2.7/site-packages/overrides/overrides.py", line 70, in overrides
method.name)
AssertionError: No super class method found for "eq"

Thank you!
molyswu

AssertionError: No super class method found for "eq"

Hello, I follow each step you introduced, but get these error, how could I do ?

Traceback (most recent call last):
File "scripts/run_model/run_bimpm.py", line 14, in
from duplicate_questions.data.instances.sts_instance import STSInstance
File "scripts/run_model/../../duplicate_questions/data/instances/sts_instance.py", line 122, in
class IndexedSTSInstance(IndexedInstance):
File "scripts/run_model/../../duplicate_questions/data/instances/sts_instance.py", line 366, in IndexedSTSInstance
@OVERRIDES
File "/Users/fubincom/Applications/anaconda2/envs/tensorflow1.0/lib/python2.7/site-packages/overrides/overrides.py", line 70, in overrides
method.name)
AssertionError: No super class method found for "eq"

Refactor DataManager workflow

Right now, the workflow with the DataManager is to get an infinite generator of train batches. Instead, it'd be better to get a function that returns a batch generator over the dataset, and then train on that as many times (epochs) as needed.

Add MultiGPU (Data Parallelism)

Looks like the train time is pretty long on AWS instances with K80s. Adding MultiGPU data parallelism would be a good way to mitigate this (as done in https://www.tensorflow.org/tutorials/using_gpu#using_multiple_gpus)

Add NER and POS features to models

NER features seem quite important. For example, how would a model distinguish Does he live in New York and Does he live in Newark; changing named entities can have drastic effects on the meaning of a sentence.

Create unifying Python setup script

Right now, setup involves running a bunch of make commands and doing some things in between at times (e.g. unzipping the GloVe download). It'd be nice to have a unified setup script that would do this all for you, and get the paths right as well (since you can get the path of the script with os.path(file) and do everything relative to that path)

Related: #58 (comment)

Make a script that runs any model conforming to the BaseModel API

It's annoying to have to write separate scripts for each model.

Add "evaluation" mode to model

Right now, the model can "train" (training on train data / periodically measure validation accuracy / loss) and it can "predict" (given an unlabeled test set, make predictions). It would be great to have an "evaluation" mode, i.e. given a train/val dataset, make predictions on it and write a file (optionally sorted by log-loss or something) with the questions, correct answer, and our answer; this would really help with error analysis.

Add batch padding

This would make our model training a lot faster.

Fix data-generation make commands

Some of the data generation make commands are broken due to path changes, etc.

Can you share the evaluation results of your implementations?

Hi, Nelson, Thanks for sharing the code! If they are available, could you also share the accuracy of the implemented models on the Quora paraphrase data?

Add docs for getting set up

There are getting to be a fair amount of random scripts and stuff, I should probably update the readme with instructions on how to get set up.

Refactor out unnecessary processing in data pipeline

right now, the data pipeline will tokenize the input into both words / characters, even if you only want words. This is fine for now since character tokenization isn't that expensive, but it's not ideal for when we want to use NER/POS features, since running the taggers is can be quite slow and we don't want to do it unless necessary.

Define BaseModel API

We should set up a BaseModel API, and have the other models inherit from it; that way, we can consolidate the run_model scripts into one script.

bimpm is too slow

thanks for your great work.
but when I run bimpm.py with 1080 ti, the train speed is very low.I want to know whether it is common situation? I know this model is compalicated

RuntimeError: Unrecognized line format

Hi, i am running the biMPM model to predict, getting the following result:

Traceback (most recent call last):
  File "run_bimpm.py", line 267, in <module>
 66%|███████████████████████████████████████████████████████████████▊                                 | 2345735/3563475 [09:22<04:51, 4172.73it/s]    main()
  File "run_bimpm.py", line 160, in main
    mode="word+character")
  File "../../duplicate_questions/data/data_manager.py", line 390, in get_test_data_from_file
    self.instance_type)
  File "../../duplicate_questions/data/dataset.py", line 145, in read_from_file
    return TextDataset.read_from_lines(lines, instance_class)
  File "../../duplicate_questions/data/dataset.py", line 177, in read_from_lines
    instances = [instance_class.read_from_line(line) for line in tqdm(lines)]
  File "../../duplicate_questions/data/dataset.py", line 177, in <listcomp>
    instances = [instance_class.read_from_line(line) for line in tqdm(lines)]
  File "../../duplicate_questions/data/instances/sts_instance.py", line 118, in read_from_line
    raise RuntimeError("Unrecognized line format: " + line)
RuntimeError: Unrecognized line format: "life in dublin?"""

Now the temporary workout is i delete the else branch, so it will skip unrecognized line

Prediction stuck at 65%

I met this error when trying to run the code (only change Glove to 840B.300d but remain filename as 6B.300d)
Does anybody know how to fix this?

File "scripts/run_model/run_bimpm.py", line 267, in
main()
File "scripts/run_model/run_bimpm.py", line 183, in main
config.pretrained_word_embeddings_file_path)
File "scripts/run_model/../../duplicate_questions/data/embedding_manager.py", line 113, in get_embedding_matrix
len(fields) - 1))
ValueError: Provided embedding_dim of 300, but file at pretrained_embeddings_file_path has embeddings of size 299

Several questions about the code

Hi, I appreciate your sharing this project! It is a very thoughtful work and friendly to newers!
I have some questions when reading the code.

In dataset.py
Instead of writing:

new_instances = [i for i in self.instances]
return self.__class__(new_instances[:max_instances])

Why don't you write:
return self.__class__(self.instances[:max_instances])

At dataset.py line 180:

        label_counts = [(label, len([x for x in group]))
                        for label, group
                        in itertools.groupby(labels, lambda x: x[0])]
        label_count_str = str(label_counts)
        if len(label_count_str) > 100:
            label_count_str = label_count_str[:100] + '...'
        logger.info("Finished reading dataset; label counts: %s",
                    label_count_str)

Why do you do above process?
Correct me if i am wrong: (FAKE EXAMPLE)

>>label_counts 
[(0,300),(1,300),(2,300)]
>>label_count_str
'[(0,300),(1,300),(2,300)]'

300 is the number of times a specific type of label appears
I don't understand why all these steps mean for? How can this calculate labels counts?

In sts_instance.py line 108:

Instead of writing:
fields = list(csv.reader([line]))[0]

Why not writing:
fields = [x for x in csv.reader([line])]

I run a trial:

>>> field = list(csv.reader(['asdf','fddf','ddd']))
>>> field
[['asdf'], ['fddf'], ['ddd']]
>>> field[0]
['asdf']

It seems list(csv.reader(['asdf','fddf','ddd']))[0] can't be used to count the number of fields? Correct me if i am wrong.

4.Why create character-level dictionary/instance?
When you tokenize the string 'Today has a good weather' to character level, the result would be ['T','o','d','a','y','h','a'............]
The character-level seems can't be used to encode contextual information?
Could you tell me some cases where character-level tokenization plays a difference?

Thank you!

Using word2vec skip-gram instead of glove

in most cases, skip-gram works better than GLOVE and using skip-gram vectors can improve the performance. Also, there are much more open source libraries to work with skip-gram.

Make SwitchableDropoutWrapper more efficient

SwitchableDropoutWrapper currently has to run the LSTM cell twice, one to get the dropped out inputs and one to get the un-dropped out inputs (and then use tf.cond to output the right one).

It'd be great to make it more efficient; i tried fixing it but couldn't get variable sharing to work properly..