Comments (7)
OK, but you agree that you require documents for training to be longer (with 2 anchors, at least 4 x max_length
) than you actually support for inference! This might be a serious issue for practical use, at least in my case.
(I think you are saying 4 because [
num_anchors == 2
by default?]
yes
This doesn't mean the model actually sees text of this length. It only ever sees text from token length
min_length
up to token lengthmax_length
. I hope that is clear. I would encourage you to check out our paper for more details, but also feel free to ask follow-up questions.
Have only skimmed the paper, to be honest :)
Again, I would try plotting the token length of your documents. This would give you a better sense of whether or not DeCLUTR is suitable.
Ok, will do.
I would also check out the training notebook and the
preprocess_wikitext_103.py
scripts if you have not. They demonstrate the process of calculatingmin_length
and then filtering WikiText103 by it to produce a subsetted corpus of 17,824 documents.
I have analyzed this script, exactly. Just to see, how much preprocessing is required which is not much, fortunately. Wikitext documents are "huge"... no way I have similar length with my data.
Finally, there is also a whole family of sentence embedding models here that might be worth checking out.
Thanks, I am already using Sentence Transformer model for extension as I wrote: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
from declutr.
Could you plot a histogram or something comparable of the token lengths of documents in your dataset? This would help in making decisions for min_length
and max_length
. Also, if the majority of data is short (e.g. less than a paragraph in length) I, unfortunately, don't think DeCLUTR is the most suitable approach.
With that said, modifying min_length
and max_length
is reasonable but there are a couple of things to keep in mind.
- Ideally
min_length
andmax_length
would be an upper and lower bound on the length of text you expect to do inference on. That way the model is trained and tested on text of similar length. - We haven't really experimented with a range of
min_length
andmax_length
in the paper, so I can only say with confidence thatmin_length=32
andmax_length=512
are good defaults. Anything else you would have to experiment with.
Thus, I quickly ran into the "known" token/span length error (see
It is worth noting that this is not an error per say but a limitation of the model. You need enough tokens for the span sampling procedure to make any sense.
from declutr.
OK, thanks for your insights anyway. I see, obviously I have to look for something else maybe... I don't see a useful point in merging/concatenating documents to meet your "defaults" :-/
What I don't understand: You mention that "Ideally min_length and max_length would be an upper and lower bound on the length of text you expect to do inference on." but you require the documents for training to have at least the multiple of 4 (!) of that upper bound!? This doesn't really make sense if you would like to do "fine tuning" of a language model for a dedicated domain. In this (my) case I would like to use exactly those kind of documents I expect to receive for inference (i.e. embedding) later...
from declutr.
We require a multiple of 2 because we always sample at least two spans from each document (an anchor and a positive). The multiple increases when num_anchors > 1
(I think you are saying 4 because num_anchors == 2
by default?). This doesn't mean the model actually sees text of this length. It only ever sees text from token length min_length
up to token length max_length
. I hope that is clear. I would encourage you to check out our paper for more details, but also feel free to ask follow-up questions.
Again, I would try plotting the token length of your documents. This would give you a better sense of whether or not DeCLUTR is suitable. I would also check out the training notebook and the preprocess_wikitext_103.py
scripts if you have not. They demonstrate the process of calculating min_length
and then filtering WikiText103 by it to produce a subsetted corpus of 17,824 documents.
Finally, there is also a whole family of sentence embedding models here that might be worth checking out.
from declutr.
I guess, I am on "uncharted lands" then with DeCLUTR and should probably look for another method which fits better to my use case.
Note: The x-axis show the length (i.e. number of tokens) and y-axis the number of documents with that length/number of tokens
from declutr.
Yes, those are very short training examples. You could try lowering the max_length
accordingly and see what kind of performance you can get. Otherwise, there are some great unsupervised sentence embedding models here that you may be able to train on your data.
from declutr.
Closing, feel free to re-open if you are still having issues.
from declutr.
Related Issues (20)
- Cant set up DECLUTR in local AWS linux machine HOT 2
- argument 'lazy' for dataset_reader HOT 2
- Superclass initialization in token embedder HOT 2
- Could not lex the character code 194 HOT 3
- Minimum text length violated despite preprocessing HOT 2
- How to plot the learning curve from the output logs created post training of declutr? HOT 1
- Installation issue HOT 8
- Wrong training procedure? HOT 6
- Strange issue occuring during Training HOT 2
- load pretrained tf1 model with pytorch HOT 5
- How to integrate a longer sequence model like longformer into declutr architecture HOT 8
- Encoder class breaks for long strings
- can i finetune the model ? HOT 2
- Update DeCLUTR requirements? HOT 5
- How to use a validation dataset when training? HOT 8
- RuntimeError: Error(s) in loading state_dict for DeCLUTR: HOT 2
- Error while encoding HOT 4
- Training with multi gpus HOT 6
- Installation fails in colab notebook HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from declutr.