facebookresearch / infersent Goto Github PK

View Code? Open in Web Editor NEW

2.3K 78.0 475.0 443 KB

InferSent sentence embeddings

License: Other

Python 20.55% Jupyter Notebook 79.45%

infersent's People

Contributors

Stargazers

Watchers

Forkers

avsolatorio zarmeen92 tozammel vdt little1tow awesome-archive hoangcuong2011 hydercps mydp2017 codeaudit ml-ai-nlp-ir cclauss alexxnica kryndex labbros wenwei-dev jattenberg mkdmomar matthewwilfred xc35 ravitejaanantha jaseweston vseledkin zhang-jian jayden11 debajyotidatta jianfly mckinziebrandon thilinicooray ivan33609 vyraun ml-lab sojvai adrianhust zhoudan0215 tony32769 huangzhechao1995 helgejo chenghuige karlinnolabs stevenlol phu-pmh omerlevy medivhc dikea mindis bekyilma koshinryuu yaserkl lilia-simeonova mehdimashayekhi sleepinyourhat nasimrahaman tskatom talschuster piccolbo hailiang-wang seanrosario daniel-007 shubhampachori12110095 js5991 linpingchuan pltrdy kilataban cosecant-csc wonyonyon songfgh prasadkawthekar faizann24 yzh119 quanticpotato drjzhou qitong sungjinlees zdstandup adableau yuhei-nishioka xogo123 vickyliin mixianghang abhishekraok leiwu-us rizwan09 ankurpandey42 kaeflint uganyasavur junweima sonea100 perryhau cogmeta ybrent anand-srivastava unosonu pngza azpoliak dayeonhwang pked01 ykankaya ihsgnef eugeneware

infersent's Issues

Some details on training AllNLI

Hi,

I'm looking at this script train_nli.py on the actual training of model. It seems like it only takes in one NLI corpus (SNLI/MultiNLI) at once. So how is the final model tuned? Is it first trained and tuned on SNLI, got the best result, and then (loaded in) trained and tuned for MultiNLI?

Dropout usage

I think this probably belongs to a question regarding the paper.

Is dropout found not helpful for InferSent training and subsequent SentEval tasks? Those parameters were set to be 0 as default for train_nli.py.

Different vectors for a sentence if it's in a batch

Here is my setup:

model = torch.load('encoder/infersent.allnli.pickle')
model.set_glove_path(<my_local_path_to_glove>)
model.build_vocab_k_words(K=100000)

s1 = "I went to the movies"
s2 = "I'm now at the beach"

vectors = model.encode([s1, s2])

Here's the printout from vectors:

array([[ 0.08746275,  0.07032087,  0.        , ...,  0.        ,
         0.        ,  0.02636627],
       [ 0.03963339,  0.09454132,  0.01724778, ..., -0.00346539,
        -0.03814263,  0.01555013]], dtype=float32)

s1_enc = model.encode([s1])

array([[ 0.08746275,  0.07032087, -0.02096005, ..., -0.03267486,
        -0.02566048,  0.02636627]], dtype=float32)

Now these look similar, but aren't exactly the same.

s2_enc = model.encode([s2])

array([[ 0.03963339,  0.09454132,  0.01724778, ..., -0.00346539,
        -0.03814263,  0.01555013]], dtype=float32)

Now these are exactly the same. I'm wondering why there's a disparity between batching up the sentences and encoding them separately.

How to train the model on AllNLI (SNLI and MultiNLI) ?

In the train code, it can only provide one dataset, but I want to train the model on both datasets (SNLI and MultiNLI). How should I do that?

Handling misspellings

2 quick question,

How well does the encoder of InferSent handle misspellings in text you're trying to encode?
E.g. "A yello cat walked down the street"

What do you think will work better to get a single representation of 100 sentences?

Taking the average embedding over 100 sentences
Concatenating 100 sentences and then embedding the one, long sentence
or will neither work very well?

torch.load import error

I tried to follow the example code that was given, it gives the following error in Mac OS X, tried many posts, but no clear answer yet. any help would be appreciated.

ImportError Traceback (most recent call last)
in ()
----> 1 infersent = torch.load('infersent.allnli.pickle')

.../lib/python2.7/site-packages/torch/serialization.pyc in load(f, map_location, pickle_module)
227 f = open(f, 'rb')
228 try:
--> 229 return _load(f, map_location, pickle_module)
230 finally:
231 if new_fd:

.../lib/python2.7/site-packages/torch/serialization.pyc in _load(f, map_location, pickle_module)
375 unpickler = pickle_module.Unpickler(f)
376 unpickler.persistent_load = persistent_load
--> 377 result = unpickler.load()
378
379 deserialized_storage_keys = pickle_module.load(f)

ImportError: No module named models

Problem with ./get_data.bash command

When i run : ./get_data.bash I get the following issue:

/usr/bin/env: ‘sed -f’: No such file or directory

Does anyone know how to solve this problem?
Thank you very much for the help!

get_data script malfunction

$ ./get_data.bash
http://nlp.stanford.edu/data/glove.840B.300d.zip
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 315 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 2075M 100 2075M 0 0 383k 0 1:32:18 1:32:18 --:--:-- 545k
Archive: glove.840B.300d.zip
warning [glove.840B.300d.zip]: 76 extra bytes at beginning or within zipfile
(attempting to process anyway)
error [glove.840B.300d.zip]: reported length of central directory is
-76 bytes too long (Atari STZip zipfile? J.H.Holm ZIPSPLIT 1.1
zipfile?). Compensating...
skipping: glove.840B.300d.txt need PK compat. v4.5 (can do v2.1)

note: didn't find end-of-central-dir signature at end of central dir.
(please check that you have transferred or created the zipfile in the
appropriate BINARY mode and that you have compiled UnZip properly)

bash get_data.bash failed

mldl@mldlUB1604:~/ub16_prj/InferSent/dataset$ bash get_data.bash
http://nlp.stanford.edu/data/glove.840B.300d.zip
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 315 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 2075M 100 2075M 0 0 1852k 0 0:19:07 0:19:07 --:--:-- 1159k
Archive: glove.840B.300d.zip
inflating: GloVe/glove.840B.300d.txt
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 90.1M 100 90.1M 0 0 2442k 0 0:00:37 0:00:37 --:--:-- 5934k
Archive: SNLI/snli_1.0.zip
creating: SNLI/snli_1.0/
inflating: SNLI/snli_1.0/.DS_Store
creating: SNLI/__MACOSX/
creating: SNLI/__MACOSX/snli_1.0/
inflating: SNLI/__MACOSX/snli_1.0/._.DS_Store
extracting: SNLI/snli_1.0/Icon
inflating: SNLI/__MACOSX/snli_1.0/._Icon
inflating: SNLI/snli_1.0/README.txt
inflating: SNLI/__MACOSX/snli_1.0/._README.txt
inflating: SNLI/snli_1.0/snli_1.0_dev.jsonl
inflating: SNLI/snli_1.0/snli_1.0_dev.txt
inflating: SNLI/snli_1.0/snli_1.0_test.jsonl
inflating: SNLI/snli_1.0/snli_1.0_test.txt
inflating: SNLI/snli_1.0/snli_1.0_train.jsonl
inflating: SNLI/snli_1.0/snli_1.0_train.txt
inflating: SNLI/__MACOSX/._snli_1.0
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0
curl: (60) server certificate verification failed. CAfile: /etc/ssl/certs/ca-certificates.crt CRLfile: none
More details here: http://curl.haxx.se/docs/sslcerts.html

curl performs SSL certificate verification by default, using a "bundle"
of Certificate Authority (CA) public keys (CA certs). If the default
bundle file isn't adequate, you can specify an alternate file
using the --cacert option.
If this HTTPS server uses a certificate signed by a CA represented in
the bundle, the certificate verification probably failed due to a
problem with the certificate (it might be expired, or the name might
not match the domain name in the URL).
If you'd like to turn off curl's verification of the certificate, use
the -k (or --insecure) option.
unzip: cannot find or open MultiNLI/multinli_0.9.zip, MultiNLI/multinli_0.9.zip.zip or MultiNLI/multinli_0.9.zip.ZIP.
rm: cannot remove 'MultiNLI/multinli_0.9.zip': No such file or directory
rm: cannot remove 'MultiNLI/__MACOSX': No such file or directory
mv: cannot stat 'MultiNLI/multinli_0.9/multinli_0.9_train.txt': No such file or directory
mv: cannot stat 'MultiNLI/multinli_0.9/multinli_0.9_dev_matched.txt': No such file or directory
mv: cannot stat 'MultiNLI/multinli_0.9/multinli_0.9_dev_mismatched.txt': No such file or directory
rm: cannot remove 'MultiNLI/multinli_0.9': No such file or directory
awk: cannot open MultiNLI/train.multinli.txt (No such file or directory)
rm: cannot remove 'MultiNLI/train.multinli.txt': No such file or directory
awk: cannot open MultiNLI/dev.matched.multinli.txt (No such file or directory)
rm: cannot remove 'MultiNLI/dev.matched.multinli.txt': No such file or directory
awk: cannot open MultiNLI/dev.mismatched.multinli.txt (No such file or directory)
rm: cannot remove 'MultiNLI/dev.mismatched.multinli.txt': No such file or directory
mldl@mldlUB1604:~/ub16_prj/InferSent/dataset$

Determine semantic sentence similarity

Hi,
I'm interested in hearing if inferSent is suitable to determine semantic similarity between sentences? By averaging the vectors for the sentence and measuring cosine distance for example.

AssertionError: assert param_from.type() == param_to.type()

My code:

infersent = torch.load('encoder/infersent.allnli.pickle', map_location=lambda storage, loc: storage)
infersent.set_glove_path('/home/my_name/Documents/stackgan/InferSent/dataset/GloVe/glove.840B.300d.txt')

test_sentence = 'This foo is bar.'

infersent.build_vocab([test_sentence], tokenize=True)
encoded = infersent.encode([test_sentence], tokenize=True)

Traceback (most recent call last):
  File "test_model.py", line 17, in <module>
    encoded = infersent.encode([test_sentence], tokenize=True)
  File "/home/my_name/Documents/stackgan/InferSent/models.py", line 196, in encode
    batch = self.forward((batch, lengths[stidx:stidx + bsize]))
  File "/home/my_name/Documents/stackgan/InferSent/models.py", line 58, in forward
    sent_output = self.enc_lstm(sent_packed)[0] #seqlen x batch x 2*nhid
  File "/home/my_name/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/my_name/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 91, in forward
    output, hidden = func(input, self.all_weights, hx)
  File "/home/my_name/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 343, in forward
    return func(input, *fargs, **fkwargs)
  File "/home/my_name/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 202, in _do_forward
    flat_output = super(NestedIOFunction, self)._do_forward(*flat_input)
  File "/home/my_name/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 224, in forward
    result = self.forward_extended(*nested_tensors)
  File "/home/my_name/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 285, in forward_extended
    cudnn.rnn.forward(self, input, hx, weight, output, hy)
  File "/home/my_name/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/rnn.py", line 255, in forward
    _copyParams(weight, params)
  File "/home/my_name/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/rnn.py", line 183, in _copyParams
    assert param_from.type() == param_to.type()
AssertionError

This is Ubuntu with Python 3.6.1 and PyTorch 0.1.12_2 on a GPU, but I'm getting the same problem with Python 2.7. The code runs fine if I specify

infersent.use_cuda = False

(and I make the correction that @wasiahmad suggests in Issue 5.)

Linear projection into label space with bias

Hi,

I noticed in the code https://github.com/facebookresearch/InferSent/blob/master/models.py, the last layer projecting to label space (n_classes). By default, nn.Linear will include bias.

Is this a desirable/intentional result? Would bias in here be just capturing the frequency distribution of the labels? I'm wondering if it's a good idea / standard practice to do this....

class ClassificationNet(nn.Module):
    def __init__(self, config):
        super(ClassificationNet, self).__init__()
        # ...
        self.classifier = nn.Sequential(
            nn.Linear(self.inputdim, 512),
            nn.Linear(512, self.n_classes),
        )

It is not compatible with the new pytorch

When I tried to run
embeddings = model.encode(sentences, bsize=128, tokenize=False, verbose=True)
I got
AttributeError: 'LSTM' object has no attribute '_data_ptrs'

It seems the source code of class 'torch.nn.modules.rnn.LSTM' has changed

Inconsistent output with pre-trained model

Using pre-trained model infersent.allnli.pickle, we seem to get inconsistent results when place the sentence in different batches. As a concrete example, let the sentence be hello, world., then

In [44]: infersent.encode(['hello, world.'], tokenize=True, verbose=True)
array([[ 0.14178185, -0.00886203, -0.03044512, ..., -0.03926747,
        -0.03814263, -0.00360173]], dtype=float32)

But

In [45]: encoder.infersent.encode(['hello, world.', 'a longer sentence is placed in the
    ...: batch'], tokenize=True, verbose=True)
array([[ 0.14178185,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.0756584 ,  0.08044495,  0.05892764, ...,  0.03711559,
         0.00114529,  0.02401241]], dtype=float32)

Looks like many dimensions of the embedding for hello, world. become zero. Checking the dropout ratio, I am sure dropout is not used (dpout_model = 0).

UnicodeDecodeError while building vocabulary

Helllo, I'm currently encountering an

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 962: character maps to <undefined>
while trying to build the vocabulary by using the Glove 840b 300dim word embedings.

Environment
OS = Windows 10
Python = 3.5.4
PyTorch = 0.3.1 (via https://anaconda.org/peterjc123/pytorch)

The full traceback:

Traceback (most recent call last):
...
...
...
  File "<Path>", line 32, in setup
    model.build_vocab_k_words(K=100000)

  File "<Path>", line 142, in build_vocab_k_words
    self.word_vec = self.get_glove_k(K)

  File "<Path>", line 118, in get_glove_k
    for line in f:

  File "<Path>\envs\tensorflow\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 962: character maps to <undefined>

I've managed to solve this issue by adding encoding='utf-8' in the following block of encoder/models:

def get_glove_k(self, K):
        assert hasattr(self, 'glove_path'), 'warning : you need \
                                             to set_glove_path(glove_path)'
        # create word_vec with k first glove vectors
        k = 0
        word_vec = {}
        with open(self.glove_path, encoding='utf-8') as f:
            for line in f:
                word, vec = line.split(' ', 1)
                if k <= K:
                    word_vec[word] = np.fromstring(vec, sep=' ')
                    k += 1
                if k > K:
                    if word in ['<s>', '</s>']:
                        word_vec[word] = np.fromstring(vec, sep=' ')

                if k > K and all([w in word_vec for w in ['<s>', '</s>']]):
                    break
        return word_vec

The *.txt file with Glove vectors is encoded in UTF-8 as mentioned at the bottom of https://nlp.stanford.edu/projects/glove/ .

In case you could confirm this issue, I would like to provide a PR.

Best,
Niels

how to get entailment

Hello

Perhaps a newbie question.

Can you tell me how to check entailment between two sentences using Infersent?

Torch causing pain

RuntimeError: module compiled against API version 0xb but this version of numpy is 0x9
Traceback (most recent call last):
File "", line 1, in
File "/Library/Python/2.7/site-packages/torch/init.py", line 53, in
from torch._C import *
ImportError: numpy.core.multiarray failed to import

-- getting this to work on a mac is starting to look impossible

Interesting sentence similarity scores

import nltk
import torch
import numpy as np

sentence_model = torch.load('infersent.allnli.pickle')
GLOVE_PATH = '../dataset/GloVe/glove.840B.300d.txt'
print('loaded infersent')

sentence_model.set_glove_path(GLOVE_PATH)
sentence_model.build_vocab_k_words(K=100000)

def similarity(sentence_model, s1, s2):
    v1 = sentence_embed(sentence_model, s1)
    v2 = sentence_embed(sentence_model, s2)
    return cosine(v1, v2)

def cosine(u, v):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

def sentence_embed(sentence_model, sentence):
    return sentence_model.encode([sentence])[0]


similarity(sentence_model, "I do not like you", "I love you") # => 0.66

Wondering if there is bug in my code or this score is expected.

ValueError when encoding

I'm running the encoder/demo.ipynb notebook with Python2.7 and PyTorch '0.1.12_1'
When running the line

embeddings = model.encode(sentences, bsize=128, tokenize=False, verbose=True)

I get the following error:

Nb words kept : 129333/130068 (99.43 %)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-26-c3081b78b915> in <module>()
----> 1 embeddings = model.encode(sentences, bsize=128, tokenize=False, verbose=True)
      2 print('nb sentences encoded : {0}'.format(len(embeddings)))

/home/brian/InferSent/encoder/models.pyc in encode(self, sentences, bsize, tokenize, verbose)
    207                 (batch, lengths[stidx:stidx + bsize])).data.cpu().numpy()
    208             embeddings.append(batch)
--> 209         embeddings = np.vstack(embeddings)
    210 
    211         # unsort

/home/brian/anaconda3/envs/py2/lib/python2.7/site-packages/numpy/core/shape_base.pyc in vstack(tup)
    235 
    236     """
--> 237     return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
    238 
    239 def hstack(tup):

ValueError: all the input array dimensions except for the concatenation axis must match exactly

Not sure if this is related but loading the model produces a warning

# make sure models.py is in the working directory
model = torch.load('infersent.allnli.pickle')

/home/brian/anaconda3/envs/py2/lib/python2.7/site-packages/torch/serialization.py:284: SourceChangeWarning: source code of class 'torch.nn.modules.rnn.LSTM' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.
  warnings.warn(msg, SourceChangeWarning)

How to train on CPU ?

I tried changing use_cuda to False in train_nli.py and -gpu_id 0

How to resume training

Hi,
First, thank you for releasing the code for your encoder and for senteval:)

I tried to add an option for resuming training, by adding the following code in line 122:
nli_net = torch.load(os.path.join(params.outputdir, params.outputmodelname))

However, the loss and accuracy still start from random guess (about 33%) so I assume it didn't properly load the weights. What am I missing here? I'm pretty new to pytorch so sorry for the trivial question.

Thanks!
Gil

implement rank words function

Can we create a separate method called rank_words to get the scores for the word in sentence. It is the same function as visualise with few lines of change?

Cannot load the trained model

I just tried the simple command and it has an error. I do have the prerequisite libraries installed.

infersent = torch.load('infersent.allnli.pickle')

ImportError Traceback (most recent call last)
in ()
----> 1 infersent = torch.load('infersent.allnli.pickle')

/home/ymeng/anaconda2/lib/python2.7/site-packages/torch/serialization.pyc in load(f, map_location, pickle_module)
229 f = open(f, 'rb')
230 try:
--> 231 return _load(f, map_location, pickle_module)
232 finally:
233 if new_fd:

/home/ymeng/anaconda2/lib/python2.7/site-packages/torch/serialization.pyc in _load(f, map_location, pickle_module)
377 unpickler = pickle_module.Unpickler(f)
378 unpickler.persistent_load = persistent_load
--> 379 result = unpickler.load()
380
381 deserialized_storage_keys = pickle_module.load(f)

ImportError: No module named models

About training InferSent on another language

Could you show me what I have to do in order to obtain a new model on another language?
By the way, it's possible to change the dimension of output vectors (from 4096 down to 256, for example).
Thank you in advance.

Error following readme

Hi, when I try to follow the readme and encode sentences I get the following error.

`In [5]: sentences
Out[5]:
['hello how are you?',
'I will make tea',
'I will not make tea',
'He will make tea']

In [7]: embeddings = infersent.encode(sentences)

IndexError Traceback (most recent call last)
in ()
----> 1 embeddings = infersent.encode(sentences)

InferSent/encoder/models.py in encode(self, sentences, bsize, tokenize, verbose)
211 # unsort
212 idx_unsort = np.argsort(idx_sort)
--> 213 embeddings = embeddings[idx_unsort]
214
215 if verbose:

IndexError: index 1 is out of bounds for axis 0 with size 1
`

Tensorflow Version

How can I convert the model to tensorflow or keras?
If I do so, is there a way to use the infersent.allnli.pickle pretrained model?
Also, I tried to use the pretrained model on a computer without a GPU, and it did not work, my mistake or is it supposed to run only on CUDA?

thanks,
great work!

Fixed Word Vectors

Just to clarify something, are the GloVe vectors fixed in this model? i.e. does gradient flow back to edit the word vectors or do they stay constant over time.

RuntimeError: tried to construct a tensor from a int sequence, but found an item of type numpy.int64 at index (0)

I tried to run the train_nli.py file to train your model but got the following error.

Traceback (most recent call last):
  File "train_nli.py", line 283, in <module>
    train_acc = trainepoch(epoch)
  File "train_nli.py", line 176, in trainepoch
    output = nli_net((s1_batch, s1_len), (s2_batch, s2_len))
  File "/if5/wua4nw/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/net/if5/wua4nw/wasi/academic/research_with_prof_chang/fb_research_repos/InferSent/models.py", line 731, in forward
    u = self.encoder(s1)
  File "/if5/wua4nw/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/net/if5/wua4nw/wasi/academic/research_with_prof_chang/fb_research_repos/InferSent/models.py", line 44, in forward
    idx_sort = torch.cuda.LongTensor(idx_sort) if self.use_cuda else torch.LongTensor(idx_sort)
RuntimeError: tried to construct a tensor from a int sequence, but found an item of type numpy.int64 at index (0)

Any guess why I am getting this error? I am using python 3.5, can it be a reason?

BLSTM Encoder retreival issue

Got this error while trying to use the pretrained model for generating sentence embeddings on a local dataset:

/usr/local/lib/python2.7/dist-packages/torch/serialization.py:284: SourceChangeWarning: source code of class 'models.BLSTMEncoder' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.
  warnings.warn(msg, SourceChangeWarning)
26126
Traceback (most recent call last):
  File "sentEmbed.py", line 18, in <module>
    embeddings = infersent.encode(sentences, bsize=128, tokenize=False, verbose=True)
  File "/home/ritvik/InferSent-master/encoder/models.py", line 198, in encode
    sentences, bsize, tokenize, verbose)
  File "/home/ritvik/InferSent-master/encoder/models.py",  line 175, in prepare_samples
    s_f = [word for word in sentences[i] if word in self.word_vec]
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 238, in __getattr__
    type(self).__name__, name))
AttributeError: 'BLSTMEncoder' object has no attribute 'word_vec'

Used the latest AllNLI pickle. A similar bug was present in the SentEval repository, but the bug fix there doesn't seem to apply here.

Error while running encoder/play.ipynb

When I run the following lines:

embeddings = model.encode(sentences, bsize=128, tokenize=False, verbose=True)
print('nb sentences encoded : {0}'.format(len(embeddings)))

I get the following error:

Nb words kept : 128201/130068 (98.56 %)

RuntimeError Traceback (most recent call last)
in ()
----> 1 embeddings = model.encode(sentences, bsize=128, tokenize=False, verbose=True)
2 print('nb sentences encoded : {0}'.format(len(embeddings)))

/home/leena/Downloads/InferSent-master/encoder/models.py in encode(self, sentences, bsize, tokenize, verbose)
177 if self.use_cuda:
178 batch = batch.cuda()
--> 179 batch = self.forward((batch, lengths[stidx:stidx + bsize])).data.cpu().numpy()
180 embeddings.append(batch)
181 embeddings = np.vstack(embeddings)

/home/leena/Downloads/InferSent-master/encoder/models.py in forward(self, sent_tuple)
48
49 # Un-sort by length
---> 50 idx_unsort = torch.from_numpy(idx_unsort).cuda() if self.use_cuda else torch.from_numpy(idx_sort)
51 sent_output = sent_output.index_select(1, Variable(idx_unsort))
52

RuntimeError: from_numpy expects an np.ndarray but got torch.LongTensor

Version details:
'3.6.0 |Anaconda custom (64-bit)| (default, Dec 23 2016, 12:22:00) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]'
In [ ]:
torch-0.1.12.post2

search of similar sentences

I have a scenario where I need to solve following issue. It is not really the issue with this repo but more like algorithmic help in using Infersent in a particular case.

So I have a set of n sentences for which I have created encodings using Infersent.

Now I get a query sentence and I want to find similar sentences (among the n sentences) to this sentence.

How do I arrange the n sentences, meaning in what data structure so that I can quickly find top k similar sentences?

Generally, how do I perform search when it comes to sentence encodings.

logic behind visualize

reference for how you guys came up with the visualise function.

CPU version?

Hello,

Is a CPU version of this code available or on the roadmap?

AllNLI model compiled for instances without CUDA?

I'd like to experiment on my laptop, is it possible to get a non-gpu pickle of the AllNLI model?

Missing evaluate_model.py script

Hey,

Readme mentions a evaluate_model.py script that is missing in this repo as well the senteval repo.

Please can you redirect me where to look for it?

Thanks in advance !

The model fails to infer simple temporal relations?

I am doing a project on temporal relation retrieval and want to try if I can use the encoder here. However, with a simple experiment, there seems to be some fundamental issue:
I gave four sentences in a list:
sentences = ['I went to school after eating breakfast.', 'I ate breakfast before going to school.', 'I went to school before eating breakfast.', 'I ate breakfast after going to school.']

After loading the model and encoding the sentences, I computed the cosine distances (the same function in your demonstration).
cosine(sentences[0], sentences[1]) 0.890059
cosine(sentences[0], sentences[2]) 0.992262
cosine(sentences[0], sentences[3]) 0.899194
cosine(sentences[1], sentences[2]) 0.900831
cosine(sentences[1], sentences[3]) 0.989823
cosine(sentences[2], sentences[3]) 0.898861

Since sentences[0] and sentences[1] have the same meaning (in terms of temporal relations), so do sentences[2] and sentences[3], the scores do not make much sense.
I understand "semantics" can be a vague thing, and sometimes sentences[0] and sentences[2] may be in the same group for certain purposes, but in most everyday use of language, it is not the case.

handling OOV entities

Great work so far on InferSent. I'm thinking of using InferSent to improve a dynamic topic modeling algorithm I have, using importance to affect the model more heavily for those words.

I'm wondering if you have good suggestions on ways we could handle OOV entities without retraining on the (small) corpus.

Ideally, we could replace each entity with a simple Entity1 Entity2 etc, but I'm not sure if the library would like this. We could naively replace each entity with an uncommon word, which seem to do a good job with the importance visualization, but will mess up any comparisons between the returned vectors.

Let me know if anything like this is possible!

Nick

CUDA-related error when trying to run with CPU

Hi,

I'm trying to run only on CPU. My PyTorch version is: torch (0.2.0.post2).
I used this line to initiate the model:
infersent = torch.load('infersent.allnli.pickle', map_location=lambda storage, loc: storage)

I got this warning:

SentEval/eval_models/models.py:54: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at
every call, possibly greately increasing memory usage. To compact weights again call flatten_parameters().
  sent_output = self.enc_lstm(sent_packed)[0]  # seqlen x batch x 2*nhid

And an assertion error when I call infersent.encode():

Traceback (most recent call last):
  File "infersent_run.py", line 155, in main
    results_transfer = se.eval(transfer_tasks)
  File "SentEval/senteval.py", line 56, in eval
    self.results = {x:self.eval(x) for x in name}
  File "SentEval/senteval.py", line 56, in <dictcomp>
    self.results = {x:self.eval(x) for x in name}
  File "SentEval/senteval.py", line 91, in eval
    self.results = self.evaluation.run(self.params, self.batcher)
  File "SentEval/binary.py", line 44, in run
    embeddings = batcher(params, batch)
  File "infersent_run.py", line 89, in batcher
    infersent_embed = params.infersent.encode(sentences, bsize=params.batch_size, tokenize=False)
  File "SentEval/eval_models/models.py", line 202, in encode
    batch = self.forward((batch, lengths[stidx:stidx + bsize]))
  File "SentEval/eval_models/models.py", line 54, in forward
    sent_output = self.enc_lstm(sent_packed)[0]  # seqlen x batch x 2*nhid
  File "/home/python2.7/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/python2.7/site-packages/torch/nn/modules/rnn.py", line 162, in forward
    output, hidden = func(input, self.all_weights, hx)
  File "/home/python2.7/site-packages/torch/nn/_functions/rnn.py", line 351, in forward
    return func(input, *fargs, **fkwargs)
  File "/home/python2.7/site-packages/torch/autograd/function.py", line 284, in _do_forward
    flat_output = super(NestedIOFunction, self)._do_forward(*flat_input)
  File "/home/python2.7/site-packages/torch/autograd/function.py", line 306, in forward
    result = self.forward_extended(*nested_tensors)
  File "/home/python2.7/site-packages/torch/nn/_functions/rnn.py", line 293, in forward_extended
    cudnn.rnn.forward(self, input, hx, weight, output, hy)
  File "/home/python2.7/site-packages/torch/backends/cudnn/rnn.py", line 259, in forward
    _copyParams(weight, params)
  File "/home/python2.7/site-packages/torch/backends/cudnn/rnn.py", line 186, in _copyParams
    assert param_from.type() == param_to.type()
AssertionError

Any idea on why this is happening, and why is it still calling cudnn even though I want to run on CPU?

Parallel Model on 40 million sentences

I have a big corpus which includes roughly 40 million sentences. Is there any way that I can run this model in parallel? Is it better for me to divide the corpus to chunks and then eventually concatenate all matrix together? Thanks!

Bug in InnerAttentionMILAEncoder

Hi,

Thanks for the great work! I'm using both InferSent and SentEval in my research and it's amazing how quickly I can test my models on so many tasks.

I'm getting this error when running with encoder_type=InnerAttentionMILAEncoder:

Traceback (most recent call last):

  File "train_nli.py", line 296, in <module>

    train_acc = trainepoch(epoch)

  File "train_nli.py", line 184, in trainepoch

    output = nli_net((s1_batch, s1_len), (s2_batch, s2_len))

  File "/users/oanuru/anaconda3/envs/infersent/lib/python2.7/site-packages/torch/nn/modules/module.py", line 357, in __call__

    result = self.forward(*input, **kwargs)

  File "/data/dgx1/oanuru/InferSent/models.py", line 815, in forward

    u = self.encoder(s1)

  File "/users/oanuru/anaconda3/envs/infersent/lib/python2.7/site-packages/torch/nn/modules/module.py", line 357, in __call__

    result = self.forward(*input, **kwargs)

  File "/data/dgx1/oanuru/InferSent/models.py", line 622, in forward

    alphas2.data[0, :, 0])), 1))

RuntimeError: dim out of range - got 1 but the tensor is only 1D

Unable to reproduce some of the results on transfer tasks

I am unable to reproduce some of the results on transfer tasks. Most results I get are a bit higher than the reported ones. I use the infersent.py script in the SentEval repo.

Setup
p2.xlarge AWS, K80 GPU, deep learning AMI template
CUDA 8
Numpy 1.13.1
scikit-learn 0.19.0
Pytorch 0.2.0.post3

Results vs. reported results:

***** Transfer task : MR *****
Dev acc : 81.8 Test acc : 81.6
Reported: 81.1

***** Transfer task : CR *****
Dev acc : 87.11 Test acc : 86.7
Reported: 86.3

***** Transfer task : SUBJ *****
Dev acc : 92.74 Test acc : 92.33
Reported: 92.4

***** Transfer task : MPQA *****
Dev acc : 90.73 Test acc : 90.63
Reported: 90.2

***** Transfer task : SST Binary
Dev acc : 83.72 Test acc : 85.01
Reported: 84.6

***** Transfer task : TREC *****
Dev acc : 84.3 Test acc : 88.2
Reported:

***** Transfer task : SICK-Relatedness*****
Test : Pearson 0.883687833603 Spearman 0.826056552356 MSE 0.223225237894
Reported:0.884

***** Transfer task : SICK-Entailment****
Dev acc : 86.8 Test acc : 86.26 for
Reported:86.1

***** Transfer task : MRPC *****
Dev acc : 76.15 Test acc 76.52; Test F1 83.17
Reported:

***** Transfer task : STS14 *****
ALL (weighted average) : Pearson = 0.6917, Spearman = 0.6632
ALL (average) : Pearson = 0.6776, Spearman = 0.6516
Reported:.68/.65

Possible PyTorch version error

Hi,

I'm using PyTorch 0.2.0
I trained a model using this code's default setting already, and just want to reload it and evaluate it on validation and test dataset.

However, I encountered this error:

togrep : ['--nlipath', 'dataset/SNLI/', '--n_epochs', '1', '--gpu_id', '2']

Namespace(batch_size=64, decay=0.99, dpout_fc=0.0, dpout_model=0.0, enc_lstm_dim=2048, encoder_type='BLSTMEncoder', fc_dim=512, gpu_id=2, lrshrink=5, max_norm=5.0, minlr=1e-05, n_classes=3, n_enc_layers=1, n_epochs=1, nlipath='dataset/SNLI/', nonlinear_fc=0, optimizer='sgd,lr=0.1', outputdir='savedir/', outputmodelname='model.pickle', pool_type='max', seed=1234)
** TRAIN DATA : Found 942069 pairs of train sentences.
** DEV DATA : Found 19657 pairs of dev sentences.
** TEST DATA : Found 19656 pairs of test sentences.
Found 102367(/123296) words with glove vectors
Vocab size : 102367
NLINet (
  (encoder): BLSTMEncoder (
    (enc_lstm): LSTM(300, 2048, bidirectional=True)
  )
  (classifier): Sequential (
    (0): Linear (16384 -> 512)
    (1): Linear (512 -> 512)
    (2): Linear (512 -> 3)
  )
)

TEST : Epoch 1
Traceback (most recent call last):
  File "train_nli.py", line 300, in <module>
    evaluate(0, 'test', True)
  File "train_nli.py", line 248, in evaluate
    output = nli_net((s1_batch, s1_len), (s2_batch, s2_len))
  File "/home/anie/miniconda2/envs/tensorflow/lib/python2.7/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
TypeError: forward() takes exactly 2 arguments (3 given)

Is this a PyTorch version-related error?
I didn't change any code except commenting out the training part.

STS Benchmark reproducibility

Hello,

I am working on a thesis about sentence embeddings and I would like to ask how exactly could I reproduce the reported results of InferSent on the STS Benchmark dataset (75.8 Pearson coef.). I tried to simply download their testing dataset, encode each pair of sentences (lower-casing them) and compute cosine similarity between them, but I only managed to get to a Pearson coef. of 71.0.

I know that you write in the description that I can use the SentEval project to reproduce the results, but it computes loads of other tasks that I am not interested in. I also looked at the code of SentEval and I am a bit confused about how exactly it preprocesses the data and computes the scores. From what I understood it uses the STS dataset from 2014 (I tried that one too, but only got to Pearson coef. of 65.8). I would much rather reproduce the result myself.

So the question is: What is the easies way to reproduce the results on STS benchmark without using the (for me too complicated) SentEval module? Should I preprocess the sentences somehow?

Thanks in advance.

fastText

Did you try using facebookresearch/fastText instead of GloVe, if so, what are the conclusions?
It makes sense that you would, given that it is a facebookresearch project.

get_data tokenizer.sed issue

Archive: SNLI/snli_1.0.zip
creating: SNLI/snli_1.0/
inflating: SNLI/snli_1.0/.DS_Store
creating: SNLI/__MACOSX/
creating: SNLI/__MACOSX/snli_1.0/
inflating: SNLI/__MACOSX/snli_1.0/._.DS_Store
extracting: SNLI/snli_1.0/Icon
inflating: SNLI/__MACOSX/snli_1.0/._Icon
inflating: SNLI/snli_1.0/README.txt
inflating: SNLI/__MACOSX/snli_1.0/._README.txt
inflating: SNLI/snli_1.0/snli_1.0_dev.jsonl
inflating: SNLI/snli_1.0/snli_1.0_dev.txt
inflating: SNLI/snli_1.0/snli_1.0_test.jsonl
inflating: SNLI/snli_1.0/snli_1.0_test.txt
inflating: SNLI/snli_1.0/snli_1.0_train.jsonl
inflating: SNLI/snli_1.0/snli_1.0_train.txt
inflating: SNLI/__MACOSX/._snli_1.0
./get_data.bash: ./tokenizer.sed: /bin/sed: bad interpreter: No such file or directory
./get_data.bash: ./tokenizer.sed: /bin/sed: bad interpreter: No such file or directory
./get_data.bash: ./tokenizer.sed: /bin/sed: bad interpreter: No such file or directory
./get_data.bash: ./tokenizer.sed: /bin/sed: bad interpreter: No such file or directory
./get_data.bash: ./tokenizer.sed: /bin/sed: bad interpreter: No such file or directory
./get_data.bash: ./tokenizer.sed: /bin/sed: bad interpreter: No such file or directory

Using infersent in long texts

Hi, if I want to use infersent to represent a long text that is composed of many sentences. What would you recommend? Using the whole text instead of its sentences, I ran out of memory in the GPU.
Thanks in advance.

'NoneType' object has no attribute 'data'

Hi
np.linalg.norm(model.encode(['the cat eats.'])) gives the error 'NoneType' object has no attribute 'data'.
Kindly help me rectify it.

Two pooling issues

Hi, thanks for sharing this code!
I noticed two issues with the current implementation of mean- / max- pooling over BiLSTM.

sent_len is not unsorted before used for normalization.
At Line 46 sent_len is sorted from biggest to smallest, and the input embeddings are adjusted accordingly. At Line 61 the hidden states are rearranged into the original order, while sent_len is not. This might lead to incorrect normalization of mean-pooling.
Padding is not handled before pooling. As a result the encoded sentence and thus the prediction are dependent on the number of paddings. I'm not sure if this is by design or a mistake. I ran into a case where running in batch vs running on each example give me different predictions, as shown below. Note that this result might not be directly reproducible as only the trained encoder is released and this example is generated from a SNLI classifier I trained on top of the released encoder.

ss1 = [
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>']
]
ss2 = [
    ['<s>', 'A', 'man', 'is', 'repainting', 'a', '</s>'],
    ['<s>', 'man', 'is', 'repainting', 'a', 'garage', '</s>'],
    ['<s>', 'A', 'man', 'is', 'a', 'garage', '</s>'],
    ['<s>', 'A', 'is', 'repainting', 'a', 'garage', '</s>'],
    ['<s>', 'A', 'man', 'repainting', 'a', 'garage', '</s>'],
    ['<s>', 'A', 'man', 'is', 'wearing', 'a', 'shirt', '</s>'],
    ['<s>', 'A', 'man', 'is', 'wearing', 'a', 'blue', '</s>'],
    ['<s>', 'A', 'is', 'wearing', 'a', 'blue', 'shirt', '</s>'],
    ['<s>', 'A', 'man', 'wearing', 'a', 'blue', 'shirt', '</s>'],
    ['<s>', 'man', 'is', 'wearing', 'a', 'blue', 'shirt', '</s>']
]

k = 6
ss1 = ss1[:k]
ss2 = ss2[:k]
model.eval()
s1, s1_len = get_batch(ss1, word_vec)
s2, s2_len = get_batch(ss2, word_vec)
s1 = Variable(s1.cuda())
s2 = Variable(s2.cuda())
p = torch.max(model((s1, s1_len), (s2, s2_len)), 1)[1].data.cpu().numpy()
for i in range(len(ss1)):
    b = (get_batch([ss1[i]], word_vec), get_batch([ss2[i]], word_vec))
    print(p[i], torch.max(forward(model, b), 1)[1].data.cpu().numpy()[0])

output:
1 0
1 1
1 1
1 1
1 1
0 0

The second issue might be related to Issue #48 .

I made an attempt to fix these two issues in my pull request. With pooling issues fixed I trained a SNLI classifier from scratch. Performance increased a little on SNLI (dev 84.56, test 84.7), but decreased on almost all transfer tasks. Here are numbers I got (Fork column):

Task	SkipThought	InferSent	Fork
MR	79.4	81.1	79.86
CR	83.1	86.3	83.16
SUBJ	93.7	92.4	92.45
MPQA	89.3	90.2	90.01
STS14	.44/.45	.68/.65	.65/.62
SICK Relatedness	0.858	0.884	0.877
SICK Entailment	79.5	86.1	85.45
SST2	82.9	84.6	81.77
SST5	-	-	44.03
TREC	88.4	88.2	85.4
MRPC	-	76.2/83.1	74.2/81.8

Time Consuming

Hi
So i was testing the inferSent model with 20k sentences and it took around 2 hours to give results. Can you tell me Why is it time consuming ?

AssertionError: Torch not compiled with CUDA enabled

Hi,
I was trying out the encoder as it is on InferSent/encoder/demo.ipynb
it works fine until I execute this line for encoding the sentences

embeddings = model.encode(sentences, bsize=128, tokenize=False, verbose=True)
print('nb sentences encoded : {0}'.format(len(embeddings)))

I am running it on cpu but it's raising the following error

"AssertionError: Torch not compiled with CUDA enabled"