Git Product home page Git Product logo

Comments (7)

repodiac avatar repodiac commented on July 4, 2024 1

OK, but you agree that you require documents for training to be longer (with 2 anchors, at least 4 x max_length) than you actually support for inference! This might be a serious issue for practical use, at least in my case.

(I think you are saying 4 because [num_anchors == 2 by default?]

yes

This doesn't mean the model actually sees text of this length. It only ever sees text from token length min_length up to token length max_length. I hope that is clear. I would encourage you to check out our paper for more details, but also feel free to ask follow-up questions.

Have only skimmed the paper, to be honest :)

Again, I would try plotting the token length of your documents. This would give you a better sense of whether or not DeCLUTR is suitable.

Ok, will do.

I would also check out the training notebook and the preprocess_wikitext_103.py scripts if you have not. They demonstrate the process of calculating min_length and then filtering WikiText103 by it to produce a subsetted corpus of 17,824 documents.

I have analyzed this script, exactly. Just to see, how much preprocessing is required which is not much, fortunately. Wikitext documents are "huge"... no way I have similar length with my data.

Finally, there is also a whole family of sentence embedding models here that might be worth checking out.

Thanks, I am already using Sentence Transformer model for extension as I wrote: sentence-transformers/paraphrase-multilingual-mpnet-base-v2

from declutr.

JohnGiorgi avatar JohnGiorgi commented on July 4, 2024

Could you plot a histogram or something comparable of the token lengths of documents in your dataset? This would help in making decisions for min_length and max_length. Also, if the majority of data is short (e.g. less than a paragraph in length) I, unfortunately, don't think DeCLUTR is the most suitable approach.

With that said, modifying min_length and max_length is reasonable but there are a couple of things to keep in mind.

  1. Ideally min_length and max_length would be an upper and lower bound on the length of text you expect to do inference on. That way the model is trained and tested on text of similar length.
  2. We haven't really experimented with a range of min_length and max_length in the paper, so I can only say with confidence that min_length=32 and max_length=512 are good defaults. Anything else you would have to experiment with.

Thus, I quickly ran into the "known" token/span length error (see

It is worth noting that this is not an error per say but a limitation of the model. You need enough tokens for the span sampling procedure to make any sense.

from declutr.

repodiac avatar repodiac commented on July 4, 2024

OK, thanks for your insights anyway. I see, obviously I have to look for something else maybe... I don't see a useful point in merging/concatenating documents to meet your "defaults" :-/

What I don't understand: You mention that "Ideally min_length and max_length would be an upper and lower bound on the length of text you expect to do inference on." but you require the documents for training to have at least the multiple of 4 (!) of that upper bound!? This doesn't really make sense if you would like to do "fine tuning" of a language model for a dedicated domain. In this (my) case I would like to use exactly those kind of documents I expect to receive for inference (i.e. embedding) later...

from declutr.

JohnGiorgi avatar JohnGiorgi commented on July 4, 2024

We require a multiple of 2 because we always sample at least two spans from each document (an anchor and a positive). The multiple increases when num_anchors > 1 (I think you are saying 4 because num_anchors == 2 by default?). This doesn't mean the model actually sees text of this length. It only ever sees text from token length min_length up to token length max_length. I hope that is clear. I would encourage you to check out our paper for more details, but also feel free to ask follow-up questions.

Again, I would try plotting the token length of your documents. This would give you a better sense of whether or not DeCLUTR is suitable. I would also check out the training notebook and the preprocess_wikitext_103.py scripts if you have not. They demonstrate the process of calculating min_length and then filtering WikiText103 by it to produce a subsetted corpus of 17,824 documents.

Finally, there is also a whole family of sentence embedding models here that might be worth checking out.

from declutr.

repodiac avatar repodiac commented on July 4, 2024

Just FYI:
image

I guess, I am on "uncharted lands" then with DeCLUTR and should probably look for another method which fits better to my use case.

Note: The x-axis show the length (i.e. number of tokens) and y-axis the number of documents with that length/number of tokens

from declutr.

JohnGiorgi avatar JohnGiorgi commented on July 4, 2024

Yes, those are very short training examples. You could try lowering the max_length accordingly and see what kind of performance you can get. Otherwise, there are some great unsupervised sentence embedding models here that you may be able to train on your data.

from declutr.

JohnGiorgi avatar JohnGiorgi commented on July 4, 2024

Closing, feel free to re-open if you are still having issues.

from declutr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.