<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

error: (lwork>=n||lwork==-1) failed for 1st keyword lwork: sorgqr:lwork=-980116480 about fast_sentence_embeddings HOT 3 CLOSED

oborchers commented on May 26, 2024 1

error: (lwork>=n||lwork==-1) failed for 1st keyword lwork: sorgqr:lwork=-980116480

from fast_sentence_embeddings.

Comments (3)

oborchers commented on May 26, 2024 1

As far as I can tell this might be related to the computation of the SVD and that 30GB of sentences might kill the SVD solver. It should nonetheless be possible to approximate the SVD components by using only a subset of sentences. I'll have to dig into this.

Btw: For that amount of text it is very likely the lib needs an approximate nearest neighbor search for similar sentences. I'm looking at Annoy

from fast_sentence_embeddings.

joelkuiper commented on May 26, 2024

Yeah I'm thinking the SVD is crashing, the machine had approx 100GB of memory left though (since the rest was memory mapped). I am not sure if there is nice iterative version of SVD, I'm guessing no. But, for such a large set of documents taking a (random) subset to approximate it might be valid; as you proposed.

And yeah, for nearest neighbor lookup it definitely needs an approximate kNN, apart from Annoy there is also https://github.com/Microsoft/SPTAG and https://github.com/facebookresearch/faiss . Annoy is nice and simple, but I found it to be very finicky in terms of number of trees used.

from fast_sentence_embeddings.

oborchers commented on May 26, 2024

@joelkuiper

I've included a solution to the problem! SIF and uSIF basically now come with a parameter "cache_size_gb", which determines the amount of ram to reserve for the SVD computation.
This is standard 1 GB, so the SVD routine will randomly sample rows from the matrix if the matrix is larger than 1 GB. Pushed this change to the development branch.

As for the approximate NN search: Is on my list. Thank you for the suggestions. I want a lib that is easily pip-able. Annoy is easy, yet I'll have to dig into this more thoroughly https://github.com/erikbern/ann-benchmarks

from fast_sentence_embeddings.

Recommend Projects

error: (lwork>=n||lwork==-1) failed for 1st keyword lwork: sorgqr:lwork=-980116480 about fast_sentence_embeddings HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent