Git Product home page Git Product logo

Comments (5)

sebastianruder avatar sebastianruder commented on July 30, 2024

Hi,

Have you been fine-tuning XLM-R or which model have you been fine-tuning to achieve a comparatively low performance on PAWS-X?

We are not currently planning to release fine-tuned models as this would mean we would need to release one model for each task. Even though fine-tuned models may be helpful to some extent for certain downstream tasks (see for instance this recent paper) we believe that the original pre-trained models should generally be used as the starting point for further experiments.

from xtreme.

atreyasha avatar atreyasha commented on July 30, 2024

Hi @sebastianruder, thank you for the quick response.

Fine-tuning instances

As I just started fine-tuning recently, I have only managed to test a few instances. Essentially, I set up this repo (as per the readme) and ran the following commands with the corresponding best (train/dev/test) results:

  1. bash scripts/train.sh bert-base-multilingual-cased pawsx

eval_results

cc_best = 0.9374687343671836
num_best = 1999
correct_best = 1874
acc_MaxLen128 = 0.9349674837418709
num_MaxLen128 = 1999
correct_MaxLen128 = 1869
...

eval_test_results

...
======= Predict using the model from checkpoint-400:
de=0.5757878939469735
en=0.5462731365682841
es=0.6068034017008505
fr=0.5927963981990996
ja=0.688344172086043
ko=0.742871435717859
zh=0.5957978989494748
total=0.6212391910240834
...
  1. bash scripts/train.sh xlm-roberta-base pawsx

eval_results

acc_best = 0.9374687343671836
num_best = 1999
correct_best = 1874
acc_MaxLen128 = 0.9309654827413707
num_MaxLen128 = 1999
correct_MaxLen128 = 1861
...

eval_test_results

...
======= Predict using the model from checkpoint-2800:
de=0.575287643821911
en=0.545272636318159
es=0.5732866433216608
fr=0.5737868934467234
ja=0.6588294147073537
ko=0.7218609304652326
zh=0.6068034017008505
total=0.6078753662545558
...
  1. bash scripts/train.sh xlm-roberta-large pawsx

    I cut this training short because the training and dev losses were not changing, and the test accuracy stayed fixed at 1.0, which seems off. Not sure why this was so.

Do you think these results were due to an "unlucky" initial configuration and that I should re-run these commands a few more times? I believe there is still some stochasticity here with the optimizer despite the model being loaded to a fixed initial checkpoint.

Fine-tuning models

We are not currently planning to release fine-tuned models as this would mean we would need to release one model for each task.

I understand. Hmm, would it be possible to share the hyperparameters and training arguments that were used to fine-tune the best performing model so far for PAWS-X (eg. for those listed on the leaderboard)? I could perhaps try to reproduce the best model with those values.

from xtreme.

atreyasha avatar atreyasha commented on July 30, 2024

Just to add more information for this issue:

I ran bash scripts/train.sh xlm-roberta-large pawsx for XLM-R (large) again but this time reduced the default learning rate from LR=2e-5 to LR=1e-6 in train_pawsx.sh (in contrast to the learning rate of 3e-5 mentioned in the XTREME paper Appendix B). Based on this, here is a snippet of the current results in eval_test_results.

======= Predict using the model from checkpoint-200:
de=1.0
en=1.0
es=1.0
fr=1.0
ja=1.0
ko=1.0
zh=1.0
total=1.0

======= Predict using the model from checkpoint-400:
de=1.0
en=1.0
es=1.0
fr=1.0
ja=1.0
ko=1.0
zh=1.0
total=1.0

======= Predict using the model from checkpoint-600:
de=1.0
en=1.0
es=1.0
fr=1.0
ja=1.0
ko=1.0
zh=1.0
total=1.0

======= Predict using the model from checkpoint-800:
de=0.870935467733867
en=0.6323161580790395
es=0.7293646823411706
fr=0.840920460230115
ja=0.9354677338669335
ko=0.9459729864932466
zh=0.8099049524762382
total=0.8235546344600871

======= Predict using the model from checkpoint-1000:
de=0.6728364182091046
en=0.39419709854927465
es=0.542271135567784
fr=0.6188094047023511
ja=0.8344172086043021
ko=0.8659329664832416
zh=0.6953476738369184
total=0.6605445579932824

Observations and questions

In checkpoint-800, the highest test accuracy appears to have been achieved and is in the ballpark with the current highest score in the XTREME leaderboard for sentence classification. However, the model at checkpoint-800 would not have been considered the best model since the dev accuracies keep rising after that.

Is there a reason why the first few checkpoints (above) all show an accuracy of 1.0? Strangely, in all the checkpoints where test accuracy was 1.0, the dev accuracies were fixed at a value of 0.568784392196098. I am not sure if this is some error or if the test sets were classified completely correctly.

from xtreme.

JunjieHu avatar JunjieHu commented on July 30, 2024

Hi @atreyasha , the testing score is around 0.5 because we remove the true class label for each example in the test set, and use a fake placeholder ("0") as the label for all the testing examples. Because we want to encourage the users to get their predictions and submit to our XTREME benchmark. Removing the label is intentional in order to avoid trivial submissions. At the beginning of the training, the model is not well trained and predict all zeros for all the examples. That's why you would observe testing accuracy to be 1 at the beginning.

If you want to use the best system for PAWS-X, you may refer to the dev set accuracy which is accurate, and we also find that the scores on the dev and test sets are well correlated.

from xtreme.

atreyasha avatar atreyasha commented on July 30, 2024

Hi @JunjieHu. Thank you, this makes sense now. Closing the issue.

from xtreme.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.