Comments (3)
Thank you for your quick reply! I am just wondering what this implies for my workflow. I have a very large document collection (4.5 million), each with a moderate amount of sentences (200 to 250). Should I batch them into large sentence collections for the training and then do inference by document afterwards?
from fast_sentence_embeddings.
Hi @julianlanger! Thank you for your feedback! I have uploaded a benchmark to the development branch. What currently takes long (about ~2s) is the induction of word frequencies.
This is a step you only have to perform once when you are using SIF or uSIF embeddings. If you just compute average embeddings, you can drop lang_freq="en" and it will take only a few ms.
Actually, I have found a small bug in the Fasttext implementation due to you request. Thank you.
If you have further questions, feel free to ask.
from fast_sentence_embeddings.
I have not yet have to work with this kind of data, although I have thought of this as a feature. I guess the fastest way would be to work with a large sentence collection and map each sentence index to a document id separately or, more easily, a document id to a tuple of indices (lo, hi), where lo represents the overall index of the first sentence in the document (assuming you stored the large-sentences in order).
from fast_sentence_embeddings.
Related Issues (20)
- Encounter "Divided 0 Error" HOT 3
- Paranmt Model HOT 3
- maintenance HOT 3
- Handling out of vocabulary HOT 2
- Hierarchical (Convolutional) Embeddings HOT 1
- MaxPooling Model
- Add Features to Sentencevectors
- SVD ram subsampling for SIF / uSIF
- Move Away from Travis.CI
- Refactor and benchmark IndexedSentence
- Rework Threading Input class
- Don't absorb KeyedVectors into BaseS2V class
- Add gensim 4.0.0 support HOT 5
- ImportError: cannot import name '_l2_norm' from 'gensim.models.keyedvectors HOT 2
- from the Results, CBOW is best, therefore why use SIF? HOT 1
- S3E pooling?
- out-of-vocabulary imputation? HOT 2
- Have full api document ? HOT 1
- Best way to save a fine-tuned vectorizer object for later use HOT 1
- error with fse.average function
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fast_sentence_embeddings.