Git Product home page Git Product logo

nelson-liu / paraphrase-id-tensorflow Goto Github PK

View Code? Open in Web Editor NEW
327.0 19.0 72.0 163 KB

Various models and code (Manhattan LSTM, Siamese LSTM + Matching Layer, BiMPM) for the paraphrase identification task, specifically with the Quora Question Pairs dataset.

License: MIT License

Makefile 0.99% Python 97.83% Shell 1.18%
nlp machine-learning deep-learning tensorflow paraphrase-identification

paraphrase-id-tensorflow's Issues

Add NER and POS features to models

NER features seem quite important. For example, how would a model distinguish Does he live in New York and Does he live in Newark; changing named entities can have drastic effects on the meaning of a sentence.

Make SwitchableDropoutWrapper more efficient

SwitchableDropoutWrapper currently has to run the LSTM cell twice, one to get the dropped out inputs and one to get the un-dropped out inputs (and then use tf.cond to output the right one).

It'd be great to make it more efficient; i tried fixing it but couldn't get variable sharing to work properly..

Add "evaluation" mode to model

Right now, the model can "train" (training on train data / periodically measure validation accuracy / loss) and it can "predict" (given an unlabeled test set, make predictions). It would be great to have an "evaluation" mode, i.e. given a train/val dataset, make predictions on it and write a file (optionally sorted by log-loss or something) with the questions, correct answer, and our answer; this would really help with error analysis.

Define BaseModel API

We should set up a BaseModel API, and have the other models inherit from it; that way, we can consolidate the run_model scripts into one script.

Several questions about the code

Hi, I appreciate your sharing this project! It is a very thoughtful work and friendly to newers!
I have some questions when reading the code.

In dataset.py
Instead of writing:

new_instances = [i for i in self.instances]
return self.__class__(new_instances[:max_instances])

Why don't you write:
return self.__class__(self.instances[:max_instances])

At dataset.py line 180:

        label_counts = [(label, len([x for x in group]))
                        for label, group
                        in itertools.groupby(labels, lambda x: x[0])]
        label_count_str = str(label_counts)
        if len(label_count_str) > 100:
            label_count_str = label_count_str[:100] + '...'
        logger.info("Finished reading dataset; label counts: %s",
                    label_count_str)

Why do you do above process?
Correct me if i am wrong: (FAKE EXAMPLE)

>>label_counts 
[(0,300),(1,300),(2,300)]
>>label_count_str
'[(0,300),(1,300),(2,300)]'

300 is the number of times a specific type of label appears
I don't understand why all these steps mean for? How can this calculate labels counts?

In sts_instance.py line 108:

Instead of writing:
fields = list(csv.reader([line]))[0]

Why not writing:
fields = [x for x in csv.reader([line])]

I run a trial:

>>> field = list(csv.reader(['asdf','fddf','ddd']))
>>> field
[['asdf'], ['fddf'], ['ddd']]
>>> field[0]
['asdf']

It seems list(csv.reader(['asdf','fddf','ddd']))[0] can't be used to count the number of fields? Correct me if i am wrong.

4.Why create character-level dictionary/instance?
When you tokenize the string 'Today has a good weather' to character level, the result would be ['T','o','d','a','y','h','a'............]
The character-level seems can't be used to encode contextual information?
Could you tell me some cases where character-level tokenization plays a difference?

Thank you!

Create unifying Python setup script

Right now, setup involves running a bunch of make commands and doing some things in between at times (e.g. unzipping the GloVe download). It'd be nice to have a unified setup script that would do this all for you, and get the paths right as well (since you can get the path of the script with os.path(file) and do everything relative to that path)

Related: #58 (comment)

bimpm is too slow

thanks for your great work.
but when I run bimpm.py with 1080 ti, the train speed is very low.I want to know whether it is common situation? I know this model is compalicated

remove the overrides , otherwise it reports error

Traceback (most recent call last):
File "scripts/run_model/run_siamese_matching_bilstm.py", line 16, in
from duplicate_questions.data.instances.sts_instance import STSInstance
File "scripts/run_model/../../duplicate_questions/data/instances/sts_instance.py", line 13, in
class STSInstance(TextInstance):
File "scripts/run_model/../../duplicate_questions/data/instances/sts_instance.py", line 69, in STSInstance
@OVERRIDES
File "/home//f.local/lib/python3.6/site-packages/overrides/overrides.py", line 70, in overrides
method.name)
AssertionError: No super class method found for "words"

RuntimeError: Unrecognized line format

Hi, i am running the biMPM model to predict, getting the following result:

Traceback (most recent call last):
  File "run_bimpm.py", line 267, in <module>
 66%|███████████████████████████████████████████████████████████████▊                                 | 2345735/3563475 [09:22<04:51, 4172.73it/s]    main()
  File "run_bimpm.py", line 160, in main
    mode="word+character")
  File "../../duplicate_questions/data/data_manager.py", line 390, in get_test_data_from_file
    self.instance_type)
  File "../../duplicate_questions/data/dataset.py", line 145, in read_from_file
    return TextDataset.read_from_lines(lines, instance_class)
  File "../../duplicate_questions/data/dataset.py", line 177, in read_from_lines
    instances = [instance_class.read_from_line(line) for line in tqdm(lines)]
  File "../../duplicate_questions/data/dataset.py", line 177, in <listcomp>
    instances = [instance_class.read_from_line(line) for line in tqdm(lines)]
  File "../../duplicate_questions/data/instances/sts_instance.py", line 118, in read_from_line
    raise RuntimeError("Unrecognized line format: " + line)
RuntimeError: Unrecognized line format: "life in dublin?"""

Now the temporary workout is i delete the else branch, so it will skip unrecognized line

AssertionError: No super class method found for "__eq__"

Hello, I follow each step you introduced, but get these error, how could I do ?

Traceback (most recent call last):
File "scripts/run_model/run_bimpm.py", line 14, in
from duplicate_questions.data.instances.sts_instance import STSInstance
File "scripts/run_model/../../duplicate_questions/data/instances/sts_instance.py", line 122, in
class IndexedSTSInstance(IndexedInstance):
File "scripts/run_model/../../duplicate_questions/data/instances/sts_instance.py", line 366, in IndexedSTSInstance
@OVERRIDES
File "/Users/fubincom/Applications/anaconda2/envs/tensorflow1.0/lib/python2.7/site-packages/overrides/overrides.py", line 70, in overrides
method.name)
AssertionError: No super class method found for "eq"

AssertionError:

Hi,
python scripts/run_model/run_siamese.py train --share_encoder_weights --model_name=baseline_siamese --run_id=0
Traceback (most recent call last):
File "scripts/run_model/run_siamese.py", line 14, in
from duplicate_questions.data.instances.sts_instance import STSInstance
File "scripts/run_model/../../duplicate_questions/data/instances/sts_instance.py", line 122, in
class IndexedSTSInstance(IndexedInstance):
File "scripts/run_model/../../duplicate_questions/data/instances/sts_instance.py", line 366, in IndexedSTSInstance
@OVERRIDES
File "/usr/lib/python2.7/site-packages/overrides/overrides.py", line 70, in overrides
method.name)
AssertionError: No super class method found for "eq"

Thank you!
molyswu

pretrained models

Hey,
I am having trouble in training the data. Could you please share the pre-trained model, that will be really helpful.
Thanks.

Using word2vec skip-gram instead of glove

in most cases, skip-gram works better than GLOVE and using skip-gram vectors can improve the performance. Also, there are much more open source libraries to work with skip-gram.

Refactor out unnecessary processing in data pipeline

right now, the data pipeline will tokenize the input into both words / characters, even if you only want words. This is fine for now since character tokenization isn't that expensive, but it's not ideal for when we want to use NER/POS features, since running the taggers is can be quite slow and we don't want to do it unless necessary.

Prediction stuck at 65%

I met this error when trying to run the code (only change Glove to 840B.300d but remain filename as 6B.300d)
Does anybody know how to fix this?

File "scripts/run_model/run_bimpm.py", line 267, in
main()
File "scripts/run_model/run_bimpm.py", line 183, in main
config.pretrained_word_embeddings_file_path)
File "scripts/run_model/../../duplicate_questions/data/embedding_manager.py", line 113, in get_embedding_matrix
len(fields) - 1))
ValueError: Provided embedding_dim of 300, but file at pretrained_embeddings_file_path has embeddings of size 299

Refactor DataManager workflow

Right now, the workflow with the DataManager is to get an infinite generator of train batches. Instead, it'd be better to get a function that returns a batch generator over the dataset, and then train on that as many times (epochs) as needed.

Add docs for getting set up

There are getting to be a fair amount of random scripts and stuff, I should probably update the readme with instructions on how to get set up.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.