I am currently trying to use DeCLUTR for extended pretraining in a multilingual settin

Just FYI: <a target="_blank" rel="noopener noreferrer nofollow" href="https://user

Impact of "shorter" documents (span, number of tokens) for extended pretraining about declutr HOT 7 CLOSED

repodiac commented on July 4, 2024

Impact of "shorter" documents (span, number of tokens) for extended pretraining

from declutr.

Comments (7)

repodiac commented on July 4, 2024 1

OK, but you agree that you require documents for training to be longer (with 2 anchors, at least 4 x max_length) than you actually support for inference! This might be a serious issue for practical use, at least in my case.

(I think you are saying 4 because [num_anchors == 2 by default?]

yes

This doesn't mean the model actually sees text of this length. It only ever sees text from token length min_length up to token length max_length. I hope that is clear. I would encourage you to check out our paper for more details, but also feel free to ask follow-up questions.

Have only skimmed the paper, to be honest :)

Again, I would try plotting the token length of your documents. This would give you a better sense of whether or not DeCLUTR is suitable.

Ok, will do.

I would also check out the training notebook and the preprocess_wikitext_103.py scripts if you have not. They demonstrate the process of calculating min_length and then filtering WikiText103 by it to produce a subsetted corpus of 17,824 documents.

I have analyzed this script, exactly. Just to see, how much preprocessing is required which is not much, fortunately. Wikitext documents are "huge"... no way I have similar length with my data.

Finally, there is also a whole family of sentence embedding models here that might be worth checking out.

Thanks, I am already using Sentence Transformer model for extension as I wrote: sentence-transformers/paraphrase-multilingual-mpnet-base-v2

from declutr.

JohnGiorgi commented on July 4, 2024

Could you plot a histogram or something comparable of the token lengths of documents in your dataset? This would help in making decisions for min_length and max_length. Also, if the majority of data is short (e.g. less than a paragraph in length) I, unfortunately, don't think DeCLUTR is the most suitable approach.

With that said, modifying min_length and max_length is reasonable but there are a couple of things to keep in mind.

Ideally min_length and max_length would be an upper and lower bound on the length of text you expect to do inference on. That way the model is trained and tested on text of similar length.
We haven't really experimented with a range of min_length and max_length in the paper, so I can only say with confidence that min_length=32 and max_length=512 are good defaults. Anything else you would have to experiment with.

Thus, I quickly ran into the "known" token/span length error (see

It is worth noting that this is not an error per say but a limitation of the model. You need enough tokens for the span sampling procedure to make any sense.

from declutr.

repodiac commented on July 4, 2024

OK, thanks for your insights anyway. I see, obviously I have to look for something else maybe... I don't see a useful point in merging/concatenating documents to meet your "defaults" :-/

What I don't understand: You mention that "Ideally min_length and max_length would be an upper and lower bound on the length of text you expect to do inference on." but you require the documents for training to have at least the multiple of 4 (!) of that upper bound!? This doesn't really make sense if you would like to do "fine tuning" of a language model for a dedicated domain. In this (my) case I would like to use exactly those kind of documents I expect to receive for inference (i.e. embedding) later...

from declutr.

JohnGiorgi commented on July 4, 2024

We require a multiple of 2 because we always sample at least two spans from each document (an anchor and a positive). The multiple increases when num_anchors > 1 (I think you are saying 4 because num_anchors == 2 by default?). This doesn't mean the model actually sees text of this length. It only ever sees text from token length min_length up to token length max_length. I hope that is clear. I would encourage you to check out our paper for more details, but also feel free to ask follow-up questions.

Again, I would try plotting the token length of your documents. This would give you a better sense of whether or not DeCLUTR is suitable. I would also check out the training notebook and the preprocess_wikitext_103.py scripts if you have not. They demonstrate the process of calculating min_length and then filtering WikiText103 by it to produce a subsetted corpus of 17,824 documents.

Finally, there is also a whole family of sentence embedding models here that might be worth checking out.

from declutr.

repodiac commented on July 4, 2024

Just FYI:

I guess, I am on "uncharted lands" then with DeCLUTR and should probably look for another method which fits better to my use case.

Note: The x-axis show the length (i.e. number of tokens) and y-axis the number of documents with that length/number of tokens

from declutr.

JohnGiorgi commented on July 4, 2024

Yes, those are very short training examples. You could try lowering the max_length accordingly and see what kind of performance you can get. Otherwise, there are some great unsupervised sentence embedding models here that you may be able to train on your data.

from declutr.

JohnGiorgi commented on July 4, 2024

Closing, feel free to re-open if you are still having issues.

from declutr.

Impact of "shorter" documents (span, number of tokens) for extended pretraining about declutr HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent