Git Product home page Git Product logo

Comments (8)

JohnGiorgi avatar JohnGiorgi commented on July 23, 2024

Hi @piegu, you can see #190 for an explanation on why we don't use a validation set during pre-training. When we evaluate the model on SentEval (after pre-training), we do have validation sets for each of the tasks which we used to guide model development and hyperparm tuning.

I am really not sure why you are seeing a loss of 0. How many examples are in this file? Also, it might be helpful to see the values for all the variables you are using in the overrides here. Finally note that as per #190, we didn't use a validation set to measure the performance of the self-supervised pre-training objectives. Instead we used average performance across the validation sets of SentEval, so this "feature" has not been tested and I doubt it works as expected.

from declutr.

piegu avatar piegu commented on July 23, 2024

How many examples are in this file?
9840

@JohnGiorgi: even if you think it is not useful here, can you test a validation dataset on you side with your notebook training.ipynb?

I think that the contrastive training loss function for the validation dataset is not found at the validation time (end of epoch).

Here is a modified code from your notebook training.ipynb:

validation_data_path = "validation.txt"

overrides = (
    f"{{'train_data_path': '{train_data_path}', "
    # lower the batch size to be able to train on Colab GPUs
    "'data_loader.batch_size': 2, "
    # training examples / batch size. Not required, but gives us a more informative progress bar during training
    "'data_loader.batches_per_epoch': 8912, "
    f"'validation_data_path': '{validation_data_path}',}}"
)

!allennlp train "declutr_small.jsonnet" \
    --serialization-dir "$output" \
    --overrides "$overrides" \
    --include-package "declutr" \
    -f

from declutr.

JohnGiorgi avatar JohnGiorgi commented on July 23, 2024

Can I ask what your plans are for using the model and what you hope to achieve by measuring the contrastive loss on a validation set?

I’m not really sure how much work it would be to support using a validation set to measure the contrastive loss performance but it’s unclear to me how helpful that would be (#190) so I don’t really plan to spend time on it.

from declutr.

piegu avatar piegu commented on July 23, 2024

Hi @JohnGiorgi

Can I ask what your plans are for using the model (...)?

I am participating in an academic project aimed at document similarity. For this, we have built a pipeline which starts by retrieving the embeddings of all the sentences of a document (via a trained DeCLUTR for example), then finding the clusters of sentences which will make it possible to create a document vector in order to calculate the cosine similarity with the vectors of all other documents.

When I read your paper and the claim that a trained DeCLUTR can outperform other (most famous) embeddings encoders, I started using your notebook training.ipynb to train it with our data.

(...) and what you hope to achieve by measuring the contrastive loss on a validation set?

Well, if you train a model with a dataset without verifying for each checkpoint how well goes the training, you do not know if you need to train with more epochs, with another learning rate, etc. Even with a contrastive loss, you can check if your validation loss continues decreasing or not (in order to avoid overfitting).

Unfortunately, your notebook training.ipynb does not allow that as the contrastive loss is not used when testing through a validation dataset. As a matter of fact, here are some lines published by the allennlp train script when using a validation dataset:

(...)
batch_loss: 0.0000, loss: 0.0000 ||: 100%|##########| 989/989 [18:38<00:00,  1.13s/it]
2022-03-22 09:27:09,791 - INFO - allennlp.training.tensorboard_writer -                        Training |  Validation
2022-03-22 09:27:09,805 - INFO - allennlp.training.tensorboard_writer - loss               |     2.057  |     0.000
2022-03-22 09:27:12,087 - INFO - allennlp.training.checkpointer - Best validation performance so far. Copying weights to 'output/best.th'.
(...)

As you can see, validation loss = 0.00, which is wrong.

You can test it through the modified version of your notebook that I put in Colab: training_with_validation_dataset.ipynb

I don’t really plan to spend time on it

If you change your plan, I'll be happy to help.

from declutr.

JohnGiorgi avatar JohnGiorgi commented on July 23, 2024

I am participating in an academic project aimed at document similarity.

Cool! Any reason you want to pre-train our model instead of just using the pre-trained models as is? We have pre-trained models both for general domain text and scientific text. There are also many pre-trained sentence embedding methods that have been proposed before and after DeCLUTR (see https://www.sbert.net/) that you could use "off-the-shelf" (unless your text comes from a unique domain not covered by these models pre-training) or that you could fine-tune on some labelled data (if you have it).

Well, if you train a model with a dataset without verifying for each checkpoint how well goes the training, you do not know if you need to train with more epochs, with another learning rate

I think there's still a major misunderstanding here. Without repeating myself I would encourage you to look at popular self-supervised literature (like BERT or SimCLR) and notice that they don't use validation sets to measure the performance of the self-supervised objectives either (as far as I can tell) so I don't think our choice to do the same is "strange". You would be much better of using some downstream task(s) that you care about to tune the hyperparameters of the pre-training stage. Or, maybe better yet, fine-tune the whole pre-trained encoder with some labelled data (if you have it). See https://www.sbert.net/docs/training/overview.html, which has instructions for fine-tuning sentence encoders on your own data. You can load DeCLUTRs weights into this library easily.

If you change your plan, I'll be happy to help.

Feel free to make a PR or fork the repo, if this is really important to you!

from declutr.

piegu avatar piegu commented on July 23, 2024

Hi @JohnGiorgi

Any reason you want to pre-train our model instead of just using the pre-trained models as is?

You have pre-trained DeCLUTR models in Portuguese for the Brazilian legal domain (one model) and also for the Brazilian health domain (second model)?

There are also many pre-trained sentence embedding methods that have been proposed before and after DeCLUTR (see https://www.sbert.net/)

Of course, I can use other embeddings encoder models than DeCLUTR but as you proved in your paper that DeCLUTR is better than others (and "easy" to train with unlabeled data), I wanted to use it instead of the others.

(...) or that you could fine-tune on some labelled data (if you have it)

Same thought about language and specific domain (see above).

Without repeating myself I would encourage you to look at popular self-supervised literature (like BERT or SimCLR) and notice that they don't use validation sets to measure the performance of the self-supervised objectives either (as far as I can tell)

Can we open the discussion both about "without validation dataset" and "with validation dataset"?

  • without validation dataset Jacob Devlin from Google explained about BERT training in 2018 (post) that "The best way to know when to stop pre-training is to take intermediate checkpoints and fine-tune them for a downstream task, and see when that stops helping (by more than some trivial amount).". This makes sens: the downstream task plays the role of the validation dataset: checking at the end of each epoch how well does the model. Thanks to that, you can decide to train on more epochs (or not) and test the best hyperparameters configuration (LR, etc..). (Note: I think we agree on that as you wrote in your precedent message "You would be much better of using some downstream task(s) that you care about to tune the hyperparameters of the pre-training stage.", but not in your paper that gives epochs and LR values without indications about how to test them).
  • with validation dataset: you can check for example the Hugging Face notebooks, and the notebook language_modeling_from_scratch.ipynb in particular. You will see that using a validation dataset for BERT training on the Language Modeling task is a good way to check how well does the model (validation loss).

from declutr.

JohnGiorgi avatar JohnGiorgi commented on July 23, 2024

You have pre-trained DeCLUTR models in Portuguese for the Brazilian legal domain (one model) and also for the Brazilian health domain (second model)?

Gotcha. Sounds like you do need to train from scratch (or possibly check out some of the work on language agnostic sentence embeddings).

Jacob Devlin from Google explained about BERT training in 2018 (google-research/bert#95 (comment)) that "The best way...

Yup, I am simply echoing Devlin's advice here. Do you have a downstream task that measures what you care about and could be used for validation, testing, and hyperparam tuning?

but not in your paper that gives epochs and LR values without indications about how to test them).

Hmm, that's not true. Section 4.1, under Training, says: "Hyperparameters were tuned on the SentEval validation sets." Is there confusion about how we did that?

More generally, why not try starting with the default hyperparameters, which worked well for us across SentEvals 18 downstream tasks and 10 probing tasks? I know your domains are quite different but I don't think there's reason to suspect these hyperparameters would perform very poorly.

from declutr.

JohnGiorgi avatar JohnGiorgi commented on July 23, 2024

Closing this! @piegu please feel free to re-open if you still have questions.

from declutr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.