Hi Oliver,
You have merged few PR 's but still have not issued new version. Is there chance for that in a near future? Or maybe if you are busy with other projects could you add maintainers to your repo. Me or the guy that did recent PR would be more than happy to contribute and help this package to survive @oborchers

dummy example doesn't work for me

Thanks for that package. Very much needed work :-)

gives: TypeError: s2v_train() takes 3 positional arguments but 4 were given

I was trying it on Google Colab

cannot import name 'BaseKeyedVectors' from 'gensim.models.keyedvectors'

I have installed gensim 3.8 and have python 3.7.


from fse.inputs import IndexedList, IndexedLineDocument
---> 12 from gensim.models.keyedvectors import BaseKeyedVectors
14 from numpy import dot, float32 as REAL, memmap as np_memmap, \

ImportError: cannot import name 'BaseKeyedVectors' from 'gensim.models.keyedvectors' (/usr/local/lib/python3.7/dist-packages/gensim/models/

I don't find any class called "BaseKeyedVectors" in gensim. Looks like its been changed to just "KeyedVectors" ?

Update to python<3.7 to fix "C extension not loaded, training/inferring will be slow. Install a C compiler and reinstall fse" warning

This is related to #18, which was closed but is not solved. Myself and another user (@lucas-ubm) are experiencing this problem on macOS systems, so it is not limited to Windows. I have tried installing gensim through conda to no avail. Any tips would be greatly appreciated.

Error message:

/opt/anaconda3/envs/sbir_covid/lib/python3.8/site-packages/fse/models/ UserWarning: C extension not loaded, training/inferring will be slow. Install a C compiler and reinstall fse.

Here is my machine setup:
macOS: MacBook Pro (15-inch, 2019), Version 11.2.3 (20D91)
Processor: 2.3 GHz 8-Core Intel Core i9
Memory: 32 GB 2400 MHz DDR4

Here is my conda env setup:

Hi, I am currently trying out your algorithm and I was wondering what speeds you achieve. On my machine (MacBook Pro), training on 200 sentences takes roughly 3 seconds. Is this normal or do you think there is something wrong? Your help would be much appreciated!

question for output

Dear @oborchers

I'm investigate sentence vector your gensim example (data and glove ).

when I check the similarity of sentence s[0],
I got result well

[(10, 1.0),
 (2, 1.0),
 (4, 1.0),
 (14, 1.0),
 (6, 1.0),
 (8, 1.0),
 (12, 1.0),
 (15, 0.9294594526290894),
 (13, 0.9294594526290894),
 (1, 0.9294594526290894)]
0 (['What', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in', 'share', 'market', 'in', 'india?'], 10)
1 (['What', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in', 'share', 'market', 'in', 'india?'], 2)
2 (['What', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in', 'share', 'market', 'in', 'india?'], 4)
3 (['What', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in', 'share', 'market', 'in', 'india?'], 14)
4 (['What', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in', 'share', 'market', 'in', 'india?'], 6)
5 (['What', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in', 'share', 'market', 'in', 'india?'], 8)
6 (['What', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in', 'share', 'market', 'in', 'india?'], 12)

and check the similarity works well

0 10 similarity :  1.0
0 2 similarity :  1.0
0 4 similarity :  1.0
0 14 similarity :  1.0
0 6 similarity :  1.0
0 8 similarity :  1.0
0 12 similarity :  1.0

however, when I check s[100] , the result is wrong. when I check s[100] and s[102]` is same sentence

100 (['Should', 'I', 'buy', 'tiago?'], 100)
102 (['Should', 'I', 'buy', 'tiago?'], 102)

and result is different.
[(3949083, 1.0),
 (897678, 1.0),
 (4229890, 1.0),
 (3949079, 1.0),
 (3949081, 1.0),
 (2934317, 1.0),
 (4093542, 1.0),
 (3949075, 1.0),
 (4229889, 1.0),
 (2934319, 1.0)]
3949083 (['Should', 'I', 'buy', 'Asus', 'Zenfone', '5?'], 3949083)
897678 (['Why', "doesn't", 'Google', 'buy', 'Quora?'], 897678)

do you have any idea?

Paranmt Model

Is there any information on getting the paranmt model or setting it up? The benchmarks show it as a great model to use with FSE and I was hoping to try it out, but I haven't been able to find it anywhere (just training data). I was just curious if there was somewhere we could access this model/keyed vectors.

Infer only returns embedding of one sentence

Given a list of input Tuples in the form of Tuple[List[str], int] I initially expected to get a numpy matrix returned of size (n, vector_size).
I suspect this is due to the following line:

output = zeros((statistics["max_index"],, dtype=REAL)

Should it be something like this?

output = zeros((statistics["total_sentences"],, dtype=REAL)

Reproducible example from the tutorial:

import gensim.downloader as api
data = api.load("quora-duplicate-questions")
glove = api.load("glove-wiki-gigaword-100")

sentences = []
for d in data:
    # Let's blow up the data a bit by replicating each sentence.
    for i in range(8):
s = IndexedList(sentences)

model = SIF(glove, workers=2)
tmp = ("Hello my friends".split(), 0)
model.infer([tmp, tmp])

AttributeError: 'Word2Vec' object has no attribute 'infer'

Hi, when I run the below line of code I am getting below error. could you please suggest.
I am using a word2vec model, which is trained with Gensim package.

code"Is this really easy to learn".split(), model=wmodel, indexable=s.items)

File "C:\ProgramData\Anaconda3\lib\site-packages\fse\models\", line 347, in similar_by_sentence
vector = model.infer([(sentence, 0)])

AttributeError: 'Word2Vec' object has no attribute 'infer'

Add Features to Sentencevectors

[ ] Sentencevectors:
[ ] Remove normalized vector files and replace with NN
ANN: --> (Annoy, with Option for Google ScANN?)
[ ] Only construct index when when calling most_similar method
[ ] Logging of index speed
[ ] Save and load of index
[ ] Assert that index and vectors are of equal size
[ ] Paramters must be tunable afterwards
[ ] Method to reconstruct index
[ ] How does the index saving comply with SaveLoad?
[ ] Write unittests?
[ ] Keep access to default method
[ ] Make ANN Search the default?! --> Results?
[ ] Throw warning for large datasets for vector norm init
[ ] Maybe throw warning if exceeds RAM size of the embedding + normalization
[ ] L2 Distance
[ ] L1 Distance
[ ] Correlation (Power Score Correlation?)
[ ] Lookup-Functionality (via defaultdict)
[ ] Get vector: Not really memory friendly
[ ] Show which words are in vocabulary
[ ] Asses empty vectors (via EPS sum)
[ ] Z-Score Transformation from Power-Means Embedding? --> Benefit?

uSIF model

I got nan values error with the uSIF model

Returning vectors with similarity above threshold for most_similar()

In most_similar() can return the topn most similar words. However it would be useful to be able to specify a similarity threshold above which the sentences are returned. For this topn could take a fractional value and therefore if topn is strictly smaller than 1 then it's considered a threshold and otherwise it works in the same way as it does now.

Ordering of sentences trained on matters for the inferred vectors.


First of all, thank you for a nice repository. I am however a bit troubled about one thing, which I hope to get answered here.

The order in which the data is inputted seem to matter for the outcome of the vectors; at least for the uSIF embedding function.

Consider the example below.

from fse.models import uSIF
from fse import IndexedList
import gensim.downloader as api

def load_w2vec(vecs: str = "word2vec-google-news-300"):
    model = api.load(vecs)
    return model

glove = load_w2vec("glove-wiki-gigaword-100")
data = [["Hello", "there", "John"], ["Hi","everyone", "good", "day"]]
input_1 = IndexedList(data)
model = uSIF(glove, lang_freq="en")
vecs = model.infer(input_1)

vecs2 = model.infer(input_)

print(f"All vectors are the same: {np.all(vecs == vecs2)}")

# Feed the model the same data for training but in another order. 
input_2 = IndexedList(data[::-1])

model = uSIF(glove, lang_freq="en")
vecs2 = model.infer(input_1) # Take the same vectors but in the original order and infer these. 
print(f"All vectors are the same: {np.all(vecs == vecs2)}")

Gives me the output

All vectors are the same: True
All vectors are the same: False

Should this really be the case? Thank you in advance!

Docs: extend example with lookup

It would be great if you could show how to use the sentence embeddings, e.g. to find the nearest sentence, or nearest N sentences, in the training corpus. I.e. how to write find_nearest() in the below:

from gensim.models import Word2Vec
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model = Word2Vec(sentences, min_count=1)

from fse.models import Sentence2Vec
se = Sentence2Vec(model)
sentences_emb = se.train(sentences)

test_sentence = [["dog", "say", "moo"]]
find_nearest(sentences_emb, test_sentence)

ImportError: cannot import name '_l2_norm' from 'gensim.models.keyedvectors

ImportError Traceback (most recent call last)

in ()
----> 1 from fse import Vectors, Average, IndexedList
2 vecs = Vectors.from_pretrained("fasttext-wiki-news-subwords-300")
3 model = Average(vecs)

3 frames

/usr/local/lib/python3.7/dist-packages/fse/models/ in ()
41 from gensim.models.base_any2vec import BaseWordEmbeddingsModel
---> 42 from gensim.models.keyedvectors import BaseKeyedVectors, FastTextKeyedVectors, _l2_norm
43 from gensim.utils import SaveLoad
44 from gensim.matutils import zeros_aligned

ImportError: cannot import name '_l2_norm' from 'gensim.models.keyedvectors' (/usr/local/lib/python3.7/dist-packages/gensim/models/

issue with fasttext model

The following code throws an error (TypeError: Cannot convert numpy.float32 to numpy.ndarray):

fb = load_facebook_model(path_to_model)
model = SIF(fb, alpha=1e-7, components=1)
model.train([IndexedSentence(s, i) for i, s in enumerate(sentences)])
this line >>['документы', 'бухгалтерия'], model=model, indexable=sentences)

However, if we replace the model with vectors, everything seems alright.

ft = KeyedVectors.load_word2vec_format(path_to_vectors)
model = SIF(ft, alpha=1e-7, components=1)
model.train([IndexedSentence(s, i) for i, s in enumerate(sentences)])['документы', 'бухгалтерия'], model=model, indexable=sentences)

This problem is really important since word counts (ft.wv.vocab) from vectors look like they were automatically recovered from vectors using cosine similarity (not sure about that) and they are not the same as from the model.

Regarding comparison

Hi, Really great work!!

I just have one question.

Which one (FastText, word2vec, glove) is good in getting better sentence embedding from respective word vectors by averaging them?

Which according to you will give better results on search results for sentences if I embed them?

Usif does not work with small data?

I'm trying to test the usif but I'm getting an error in the SVD part about nan values in the vector.
I took the example of the Average and changed to usif

from gensim.models import FastText
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
ft = FastText(sentences, min_count=1, size=10)

from fse.models import uSIF
from fse import IndexedList
model = uSIF(ft, components=1)

The error is the following ocurring during the fit() of the TruncatedSVD

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

I'm using Python 3.6.5 :: Anaconda, Inc.

Optimize for Fasttext supervised mode

Hi @oborchers ,
first, kudos for the superb work! 👍
I am exploring SIF embeddings with your library. I have noticed, looking at the code, that there are some specific optimizations for FastText models. Now, since Gensim does not seem to support supervised FT models in any way, I currently "force" it to load my model through the .vec files (instead of the .bin), like this:

ft = KeyedVectors.load_word2vec_format("model_ft.vec")

which means that the model that I load is not an instance of FastTextKeyedVectors, and thus I will not exploit any of these optimizations when I train the SIF model.
Indeed, the resulting model is quite heavy on the RAM usage, so I am wondering if there is any better way to do this, also considering that the only function that I will call after training will be infer.

Also I am thinking, can one change the memmap settings in a second moment? I am thinking of something like training on RAM, write the whole SIF model to disk and then loading in a second moment but keeping the word vectors on disk (or also change the path of wv_mapfile_path to another location, like for example if we change machine).

Any kind of hint would be highly appreciated! And thanks again 😃

use normalized word vectors or not?

Hi, I am using word2vec model to calculate sentence level embeddings and I was wondering if I should use normalized word vectors to train the SIF model?

Gensim version ImportError: cannot import name 'BaseKeyedVectors'

Dear fse creator,

Below import gives ImportError: cannot import name 'BaseKeyedVectors'.
from fse import SplitIndexedList

We think it's from the compatibility of gensim, so were wondering what is the gensim version we should use.
(we are using the latest gensim 4.0.0)

This github issue suggests it should be from gensim.models.keyedvectors import KeyedVector

-- Luke

Reduce Model Size for Inference

Is it possible to reduce model size by discarding previously calculated sentence vectors? This would be for downstream inference-only uses.


train on data and predict on new data

Hi I see that the train method of an fse object returns the sentence embedding.
Is threre a predict method to apply the trained modeled on new data?
Or train stays for predict?


save file is very large?

I found my file is just (758194, 100) matrix, but that file is 15G, while I save a (800000, 100) matrix to npy file, it just 600mb, so is it normal? I train the sif model on 30 million sentences

-rw-r--r-- 1 ke ke  43M oct 11 19:09 sif_model
-rw-r--r-- 1 ke ke  15G oct 11 19:09  <<----- this file very large
-rw-r--r-- 1 ke ke 290M oct 11 19:07 sif_model.wv.vectors.npy

C extension not loaded, training/inferring will be slow

Hi, I've installed fse in Windows I get the following warning
C:\Users\CARLOS\Anaconda3\lib\site-packages\fse\models\ UserWarning: C extension not loaded, training/inferring will be slow. Install a C compiler and reinstall fse.
"C extension not loaded, training/inferring will be slow. "

I know that this type of warning also appears in gensim in some cases, however, in gensim I don't have this problem. Does anyone have any idea what should I do?

My results on gensim

from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)

My training speed is :

training on 571201 effective sentences with 4590776 effective words took 28s with 19703 sentences/s

Rework Threading Input class

Reworking the threading (at least from my last experience the input thread is the bottleneck, not the actual computation)

Encounter "Divided 0 Error"

Hi Oliver,

Thanks to this great repo!
But, I found some issues when I use it with SentEval.

Generally speaking, the problem is "divide by zero error" when I use uSIF(glove, length=12, ncomponents=1). The error was raised at calculating a = (1 - alpha)/(alpha * 2) at the fse/models/

If you need a minimum running code to reproduce the error, please let me know.

Hierarchical pooling

Could you say something more about hierarchical pooling?
I am interested in this feature, but I'm not sure what you mean.
I can try to implement this if given some guidance.

RuntimeError: You must first train the model to obtain SVD components


Model training can done easily. But after save my model i can load it.

from fse.models import SIF
model = SIF(w2v, workers=8)

But when I try to
similars ="my sentence".split(), model=model, indexable=doc.items, topn=100)
I got error


GENSIM KeyedVectors and downloadable Models

It appears that when I download any model from the downloader api in gensim or saved a Word2Vec and re-load it using a KeyedVectors format, the vocab object is storing a reverse index in the "count" variable. So for example, if I have 10 words in the model, the first word has a count of 10 and an index of 0.

Using the following code:

word_vectors = api.load('glove-wiki-gigaword-100')
sif_model = uSIF(model=word_vectors)

The word_vectors.wv.vocab shows the first word to be:
"the" and the count = 400000 and the index = 0
For each succeeding word in the model the count goes down by one, and the index goes up by 1.

Clearly this is not the frequency information.

I took this example from your jupyter workbook so I am assuming that something has changed with the models themselves? Any guidance on this would be helpful. I CAN create my on word2vec models and it has the frequency values as expected and the precalculation works as expected.

Thanks for any thoughts or guidance on this. Perhaps this is normal that none of these models retain the word frequencies.


Michael Wade

slow speed for SIF model for large corpus

I have been experimenting with fse. For small dataset 200-300k sentences, embedding generation was very fast. But now i am training with large data corpus of 50 million sentences. I am using 12 workers and still the training for embeddings is very slow. From logs it is somewhat 700 sentences/sec. I am using gensim.models.FastText
Also got a user warning of "C extension not loaded, training/inferring will be slow. " on Ubuntu 16.04. Any way to increase the speed?
Thank you

error: (lwork>=n||lwork==-1) failed for 1st keyword lwork: sorgqr:lwork=-980116480

2019-10-04 12:19:33,452 : MainThread : INFO : worker thread finished; awaiting finish of 1 more threads
2019-10-04 12:19:33,452 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
error                                     Traceback (most recent call last)
<ipython-input-11-7bb656653df0> in <module>
----> 1 model.train(doc)

~/anaconda3/lib/python3.6/site-packages/fse/models/ in train(self, sentences, update, queue_factor, report_delay)
    641         # Preform post-tain calls (i.e principal component removal)
--> 642         self._post_train_calls()
    644         self._log_train_end(eff_sentences=eff_sentences, eff_words=eff_words, overall_time=overall_time)

~/anaconda3/lib/python3.6/site-packages/fse/models/ in _post_train_calls(self)
     79         """ Function calls to perform after training, such as computing eigenvectors """
     80         if self.components > 0:
---> 81             self.svd_res = compute_principal_components(, components=self.components)
     82             self.svd_weights = (self.svd_res[0] ** 2) / (self.svd_res[0] ** 2).sum().astype(REAL)
     83             remove_principal_components(, svd_res=self.svd_res, weights=self.svd_weights, inplace=True)

~/anaconda3/lib/python3.6/site-packages/fse/models/ in compute_principal_components(vectors, components)
     32     start = time()
     33     svd = TruncatedSVD(n_components=components, n_iter=7, random_state=42, algorithm="randomized")
---> 34
     35     elapsed = time()
     36"computing {components} principal components took {int(elapsed-start)}s")

~/anaconda3/lib/python3.6/site-packages/sklearn/decomposition/ in fit(self, X, y)
    139             Returns the transformer object.
    140         """
--> 141         self.fit_transform(X)
    142         return self

~/anaconda3/lib/python3.6/site-packages/sklearn/decomposition/ in fit_transform(self, X, y)
    176             U, Sigma, VT = randomized_svd(X, self.n_components,
    177                                           n_iter=self.n_iter,
--> 178                                           random_state=random_state)
    179         else:
    180             raise ValueError("unknown algorithm %r" % self.algorithm)

~/anaconda3/lib/python3.6/site-packages/sklearn/utils/ in randomized_svd(M, n_components, n_oversamples, n_iter, power_iteration_normalizer, transpose, flip_sign, random_state)
    333     Q = randomized_range_finder(M, n_random, n_iter,
--> 334                                 power_iteration_normalizer, random_state)
    336     # project M to the (k + p) dimensional space using the basis vectors

~/anaconda3/lib/python3.6/site-packages/sklearn/utils/ in randomized_range_finder(A, size, n_iter, power_iteration_normalizer, random_state)
    224     # Sample the range of A using by linear projection of Q
    225     # Extract an orthonormal basis
--> 226     Q, _ = linalg.qr(safe_sparse_dot(A, Q), mode='economic')
    227     return Q

~/anaconda3/lib/python3.6/site-packages/scipy/linalg/ in qr(a, overwrite_a, lwork, mode, pivoting, check_finite)
    163     elif mode == 'economic':
    164         Q, = safecall(gor_un_gqr, "gorgqr/gungqr", qr, tau, lwork=lwork,
--> 165                       overwrite_a=1)
    166     else:
    167         t = qr.dtype.char

~/anaconda3/lib/python3.6/site-packages/scipy/linalg/ in safecall(f, name, *args, **kwargs)
     19         ret = f(*args, **kwargs)
     20         kwargs['lwork'] = ret[-2][0].real.astype(
---> 21     ret = f(*args, **kwargs)
     22     if ret[-1] < 0:
     23         raise ValueError("illegal value in %d-th argument of internal %s"

error: (lwork>=n||lwork==-1) failed for 1st keyword lwork: sorgqr:lwork=-980116480

When executing

from gensim.models.keyedvectors import FastTextKeyedVectors as kv
from fse.models import uSIF
from fse import IndexedLineDocument
ft = kv.load("<path to pretrained fasttext>") 

from fse.models.average import FAST_VERSION, MAX_WORDS_IN_BATCH 

doc = IndexedLineDocument("<very large list of sentences.txt>") 
model = uSIF(ft, workers=28, sv_mapfile_path="../tmp/sv_map", wv_mapfile_path="../tmp/wv_map") 


Note that it also happens with regular SIF, the list of sentences is approx 30GB with the embedding dimension being 128. Not entirely sure how to debug this further, any thoughts?

List of dependencies from Anaconda

Handling out of vocabulary


I am using this package to compile reasonable word vectors, but for some short compilations of words, all my words are OOV. I tried using FastText, but I get:

*** RuntimeError: Model must be child of BaseWordEmbeddingsModel or BaseKeyedVectors. Received FastText(vocab=2519370, size=300, alpha=0.025)

Is it possible to use FastText and handle Out of vocabulary words?

Thank you!

method save works( somehow) but load does not

Hi when I call .save method on sif model it works - although as I understand the only way to save/serialize model on disc is by using pickle?"model_sif2")

trying to using save should return error , the same as when i try to load saved model

model_sif2= FT_gensim.load("model_sif2")

AttributeError: Can't get attribute 'FastTextKeyedVectors' on <module 'gensim.models.deprecated.keyedvectors' from '/j/miniconda3/envs/clean_unsup/lib/python3.7/site-packages/gensim/models/deprecated/'>

Does FSE guarantee ordering of vectors to be that of the input sentences?

For an example like:

import pandas as pd

from fse.models import uSIF
from fse import SplitIndexedList
from gensim.models.keyedvectors import FastTextKeyedVectors

fasttext_model_path = "models/fasttext-wiki-news-subwords-300.model"
ft = FastTextKeyedVectors.load(fasttext_model_path)

sent_fp = "data/sentences/sentences.csv.gz"
df = pd.read_csv(sent_fp)

sentences = df.sentence.values

indexed_sentences = SplitIndexedList(sentences)

model = uSIF(ft, workers=2, lang_freq="en")

sentence_count, word_count = model.train(indexed_sentences)

embeddings =

Where I read in an ordered list of sentences and then process them through a pre-trained model, does FSE guarantee the order of the model vectors to be the same order that the sentences were fed in?

I didn't see anything in the documentation or source code to suggest they wouldn't be, but I also haven't seen in the documentation any claims for guaranteed ordering either.


Don't absorb KeyedVectors into BaseS2V class

Untangling the bad design decision to actually store the BaseKeyedVector from Gensim internally. If users want mmap, they can just load that and pass it. At least we shouldn't store it with the model.

