Git Product home page Git Product logo

Comments (5)

craffel avatar craffel commented on September 26, 2024 5

We don't monitor the validation perplexity. C4 is so big that none of our models could possibly overfit to it. For all of the experiments in the paper except those in Table 14 we don't even come close to making a single pass over the dataset; for the experiments in Table 14 we pre-train for 1T tokens which does end up being a single pass-ish depending on the mixing rate. But, overall, there's no substantive difference between train and validation perplexity for any of the experiments (except those in Figure 6/Table 9, where we artificially limit the size of C4, but that's the only case where we actually show the training loss).

For Table 14, the train perplexities are not comparable across models because they use different mixing rates (corresponding to a different artificial dataset size for each model). As a result some models see more or less unlabeled data, and as I mentioned above the perplexity for supervised tasks can be tiny due to the effectively limited set of reasonable tokens for classification tasks. I don't think you would learn anything from looking at those perplexities.

For Table 13, here is a plot of the training losses (proportional to perplexity) for the variants in the table (not including ensembles, since ensembling is done post-hoc):

image

As a key, sc-bi_v1 is the baseline trained 4x as long, sc-bi_v1-bsx4 is the baseline trained with a 4x bigger batch size, sc-bi_v1-2x is the 2x larger model trained 2x as long, and sc-bi_v1-4x is the 4x bigger model trained for the same number of steps as the baseline.

from text-to-text-transfer-transformer.

craffel avatar craffel commented on September 26, 2024 1

Thanks for your interest!

from text-to-text-transfer-transformer.

craffel avatar craffel commented on September 26, 2024

Thanks!

Do you want the train perplexity or validation perplexity? And do you want it for pre-training or fine-tuning?

I'm curious why you want this. For Table 14, the perplexity will be averaged over both the unsupervised task and for the supervised tasks which will be on a totally different scale since for some of the supervised tasks (classification) it's trivial for the model to reduce the perplexity to small values (given the task prefix only ~two tokens are really valid).

from text-to-text-transfer-transformer.

ngoyal2707 avatar ngoyal2707 commented on September 26, 2024

@craffel Thanks for quick reply.
Ideally both train_ppl and valid_ppl of pre-training. Agree that fine-tuning ppl is not much interesting.

In terms of how this can be helpful, since ppl is decent measure of model quality during pretraining, I am curious about whether (4xtraining, 1xmodel) achieves lower ppl than (1xtraining, 4xmodel) etc. Also if the train/valid ppl of pretraining correlates highly to downstream task performance.
Similar for big models, I am curious as how much T5-11b does better on pretraining task compared to say T5-large.

Let me know if above make sense.

from text-to-text-transfer-transformer.

ngoyal2707 avatar ngoyal2707 commented on September 26, 2024

This is very helpful, thanks for sharing!

from text-to-text-transfer-transformer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.