nelson-liu / paraphrase-id-tensorflow Goto Github PK

Various models and code (Manhattan LSTM, Siamese LSTM + Matching Layer, BiMPM) for the paraphrase identification task, specifically with the Quora Question Pairs dataset.

License: MIT License

Makefile 0.99% Python 97.83% Shell 1.18%

nlp machine-learning deep-learning tensorflow paraphrase-identification

paraphrase-id-tensorflow's Issues

Fix data-generation make commands

Some of the data generation make commands are broken due to path changes, etc.

Add NER and POS features to models

NER features seem quite important. For example, how would a model distinguish Does he live in New York and Does he live in Newark; changing named entities can have drastic effects on the meaning of a sentence.

Make SwitchableDropoutWrapper more efficient

SwitchableDropoutWrapper currently has to run the LSTM cell twice, one to get the dropped out inputs and one to get the un-dropped out inputs (and then use tf.cond to output the right one).

It'd be great to make it more efficient; i tried fixing it but couldn't get variable sharing to work properly..

Add "evaluation" mode to model

Right now, the model can "train" (training on train data / periodically measure validation accuracy / loss) and it can "predict" (given an unlabeled test set, make predictions). It would be great to have an "evaluation" mode, i.e. given a train/val dataset, make predictions on it and write a file (optionally sorted by log-loss or something) with the questions, correct answer, and our answer; this would really help with error analysis.

Define BaseModel API

We should set up a BaseModel API, and have the other models inherit from it; that way, we can consolidate the run_model scripts into one script.

Make word vector initialization random uniform

Right now, we're using the same initialization as in BiDAF, but we should change it to be more aligned with recent work on PPID/NLI/STS

Add batch padding

This would make our model training a lot faster.

Several questions about the code

Hi, I appreciate your sharing this project! It is a very thoughtful work and friendly to newers!
I have some questions when reading the code.

In dataset.py
Instead of writing:

new_instances = [i for i in self.instances]
return self.__class__(new_instances[:max_instances])

Why don't you write:
return self.__class__(self.instances[:max_instances])

At dataset.py line 180:

        label_counts = [(label, len([x for x in group]))
                        for label, group
                        in itertools.groupby(labels, lambda x: x[0])]
        label_count_str = str(label_counts)
        if len(label_count_str) > 100:
            label_count_str = label_count_str[:100] + '...'
        logger.info("Finished reading dataset; label counts: %s",
                    label_count_str)

Why do you do above process?
Correct me if i am wrong: (FAKE EXAMPLE)

>>label_counts 
[(0,300),(1,300),(2,300)]
>>label_count_str
'[(0,300),(1,300),(2,300)]'

300 is the number of times a specific type of label appears
I don't understand why all these steps mean for? How can this calculate labels counts?

In sts_instance.py line 108:

Instead of writing:
fields = list(csv.reader([line]))[0]

Why not writing:
fields = [x for x in csv.reader([line])]

I run a trial:

>>> field = list(csv.reader(['asdf','fddf','ddd']))
>>> field
[['asdf'], ['fddf'], ['ddd']]
>>> field[0]
['asdf']

It seems list(csv.reader(['asdf','fddf','ddd']))[0] can't be used to count the number of fields? Correct me if i am wrong.

4.Why create character-level dictionary/instance?
When you tokenize the string 'Today has a good weather' to character level, the result would be ['T','o','d','a','y','h','a'............]
The character-level seems can't be used to encode contextual information?
Could you tell me some cases where character-level tokenization plays a difference?

Thank you!

Add MultiGPU (Data Parallelism)

Looks like the train time is pretty long on AWS instances with K80s. Adding MultiGPU data parallelism would be a good way to mitigate this (as done in https://www.tensorflow.org/tutorials/using_gpu#using_multiple_gpus)

Create unifying Python setup script

Right now, setup involves running a bunch of make commands and doing some things in between at times (e.g. unzipping the GloVe download). It'd be nice to have a unified setup script that would do this all for you, and get the paths right as well (since you can get the path of the script with os.path(file) and do everything relative to that path)

Related: #58 (comment)

Can you share the evaluation results of your implementations?

Hi, Nelson, Thanks for sharing the code! If they are available, could you also share the accuracy of the implemented models on the Quora paraphrase data?

STSInstance.get_int_{word/char}_indices needs a docstring

I forgot to add a docstring for STSInstance.get_int_word_indices and STSInstance.get_int_char_indices, and they should both have one.

bimpm is too slow

thanks for your great work.
but when I run bimpm.py with 1080 ti, the train speed is very low.I want to know whether it is common situation? I know this model is compalicated

remove the overrides , otherwise it reports error

Traceback (most recent call last):
File "scripts/run_model/run_siamese_matching_bilstm.py", line 16, in
from duplicate_questions.data.instances.sts_instance import STSInstance
File "scripts/run_model/../../duplicate_questions/data/instances/sts_instance.py", line 13, in
class STSInstance(TextInstance):
File "scripts/run_model/../../duplicate_questions/data/instances/sts_instance.py", line 69, in STSInstance
@OVERRIDES
File "/home//f.local/lib/python3.6/site-packages/overrides/overrides.py", line 70, in overrides
method.name)
AssertionError: No super class method found for "words"

Training Time Takes too Long (BiMPM Model)

Even I enable GPU computing (Nvidia tesla k80 X4) , it takes around 10 hours to finish 1 epoch.
Just wonder if this behavior is expected or not.

RuntimeError: Unrecognized line format

Hi, i am running the biMPM model to predict, getting the following result:

Traceback (most recent call last):
  File "run_bimpm.py", line 267, in <module>
 66%|███████████████████████████████████████████████████████████████▊                                 | 2345735/3563475 [09:22<04:51, 4172.73it/s]    main()
  File "run_bimpm.py", line 160, in main
    mode="word+character")
  File "../../duplicate_questions/data/data_manager.py", line 390, in get_test_data_from_file
    self.instance_type)
  File "../../duplicate_questions/data/dataset.py", line 145, in read_from_file
    return TextDataset.read_from_lines(lines, instance_class)
  File "../../duplicate_questions/data/dataset.py", line 177, in read_from_lines
    instances = [instance_class.read_from_line(line) for line in tqdm(lines)]
  File "../../duplicate_questions/data/dataset.py", line 177, in <listcomp>
    instances = [instance_class.read_from_line(line) for line in tqdm(lines)]
  File "../../duplicate_questions/data/instances/sts_instance.py", line 118, in read_from_line
    raise RuntimeError("Unrecognized line format: " + line)
RuntimeError: Unrecognized line format: "life in dublin?"""

Now the temporary workout is i delete the else branch, so it will skip unrecognized line

AssertionError: No super class method found for "eq"

Hello, I follow each step you introduced, but get these error, how could I do ?

Traceback (most recent call last):
File "scripts/run_model/run_bimpm.py", line 14, in
from duplicate_questions.data.instances.sts_instance import STSInstance
File "scripts/run_model/../../duplicate_questions/data/instances/sts_instance.py", line 122, in
class IndexedSTSInstance(IndexedInstance):
File "scripts/run_model/../../duplicate_questions/data/instances/sts_instance.py", line 366, in IndexedSTSInstance
@OVERRIDES
File "/Users/fubincom/Applications/anaconda2/envs/tensorflow1.0/lib/python2.7/site-packages/overrides/overrides.py", line 70, in overrides
method.name)
AssertionError: No super class method found for "eq"

AssertionError:

Hi,
python scripts/run_model/run_siamese.py train --share_encoder_weights --model_name=baseline_siamese --run_id=0
Traceback (most recent call last):
File "scripts/run_model/run_siamese.py", line 14, in
from duplicate_questions.data.instances.sts_instance import STSInstance
File "scripts/run_model/../../duplicate_questions/data/instances/sts_instance.py", line 122, in
class IndexedSTSInstance(IndexedInstance):
File "scripts/run_model/../../duplicate_questions/data/instances/sts_instance.py", line 366, in IndexedSTSInstance
@OVERRIDES
File "/usr/lib/python2.7/site-packages/overrides/overrides.py", line 70, in overrides
method.name)
AssertionError: No super class method found for "eq"

Thank you!
molyswu

pretrained models

Hey,
I am having trouble in training the data. Could you please share the pre-trained model, that will be really helpful.
Thanks.

Using word2vec skip-gram instead of glove

in most cases, skip-gram works better than GLOVE and using skip-gram vectors can improve the performance. Also, there are much more open source libraries to work with skip-gram.

Refactor out unnecessary processing in data pipeline

right now, the data pipeline will tokenize the input into both words / characters, even if you only want words. This is fine for now since character tokenization isn't that expensive, but it's not ideal for when we want to use NER/POS features, since running the taggers is can be quite slow and we don't want to do it unless necessary.

Prediction stuck at 65%

I met this error when trying to run the code (only change Glove to 840B.300d but remain filename as 6B.300d)
Does anybody know how to fix this?

File "scripts/run_model/run_bimpm.py", line 267, in
main()
File "scripts/run_model/run_bimpm.py", line 183, in main
config.pretrained_word_embeddings_file_path)
File "scripts/run_model/../../duplicate_questions/data/embedding_manager.py", line 113, in get_embedding_matrix
len(fields) - 1))
ValueError: Provided embedding_dim of 300, but file at pretrained_embeddings_file_path has embeddings of size 299

nelson-liu / paraphrase-id-tensorflow Goto Github PK

paraphrase-id-tensorflow's Issues

Recommend Projects

Recommend Topics

Recommend Org