Hi, first of all, great work, congrats!!! The experiments in the paper are very detail

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Can you share perplexity during pretraining for some experiments about text-to-text-transfer-transformer HOT 5 CLOSED

google-research commented on September 26, 2024

Can you share perplexity during pretraining for some experiments

from text-to-text-transfer-transformer.

Comments (5)

craffel commented on September 26, 2024 5

We don't monitor the validation perplexity. C4 is so big that none of our models could possibly overfit to it. For all of the experiments in the paper except those in Table 14 we don't even come close to making a single pass over the dataset; for the experiments in Table 14 we pre-train for 1T tokens which does end up being a single pass-ish depending on the mixing rate. But, overall, there's no substantive difference between train and validation perplexity for any of the experiments (except those in Figure 6/Table 9, where we artificially limit the size of C4, but that's the only case where we actually show the training loss).

For Table 14, the train perplexities are not comparable across models because they use different mixing rates (corresponding to a different artificial dataset size for each model). As a result some models see more or less unlabeled data, and as I mentioned above the perplexity for supervised tasks can be tiny due to the effectively limited set of reasonable tokens for classification tasks. I don't think you would learn anything from looking at those perplexities.

For Table 13, here is a plot of the training losses (proportional to perplexity) for the variants in the table (not including ensembles, since ensembling is done post-hoc):

As a key, sc-bi_v1 is the baseline trained 4x as long, sc-bi_v1-bsx4 is the baseline trained with a 4x bigger batch size, sc-bi_v1-2x is the 2x larger model trained 2x as long, and sc-bi_v1-4x is the 4x bigger model trained for the same number of steps as the baseline.

from text-to-text-transfer-transformer.

craffel commented on September 26, 2024 1

Thanks for your interest!

from text-to-text-transfer-transformer.

craffel commented on September 26, 2024

Thanks!

Do you want the train perplexity or validation perplexity? And do you want it for pre-training or fine-tuning?

I'm curious why you want this. For Table 14, the perplexity will be averaged over both the unsupervised task and for the supervised tasks which will be on a totally different scale since for some of the supervised tasks (classification) it's trivial for the model to reduce the perplexity to small values (given the task prefix only ~two tokens are really valid).

from text-to-text-transfer-transformer.

ngoyal2707 commented on September 26, 2024

@craffel Thanks for quick reply.
Ideally both train_ppl and valid_ppl of pre-training. Agree that fine-tuning ppl is not much interesting.

In terms of how this can be helpful, since ppl is decent measure of model quality during pretraining, I am curious about whether (4xtraining, 1xmodel) achieves lower ppl than (1xtraining, 4xmodel) etc. Also if the train/valid ppl of pretraining correlates highly to downstream task performance.
Similar for big models, I am curious as how much T5-11b does better on pretraining task compared to say T5-large.

Let me know if above make sense.

from text-to-text-transfer-transformer.

ngoyal2707 commented on September 26, 2024

This is very helpful, thanks for sharing!

from text-to-text-transfer-transformer.

Can you share perplexity during pretraining for some experiments about text-to-text-transfer-transformer HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent