Comments (5)
Hi,
Have you been fine-tuning XLM-R or which model have you been fine-tuning to achieve a comparatively low performance on PAWS-X?
We are not currently planning to release fine-tuned models as this would mean we would need to release one model for each task. Even though fine-tuned models may be helpful to some extent for certain downstream tasks (see for instance this recent paper) we believe that the original pre-trained models should generally be used as the starting point for further experiments.
from xtreme.
Hi @sebastianruder, thank you for the quick response.
Fine-tuning instances
As I just started fine-tuning recently, I have only managed to test a few instances. Essentially, I set up this repo (as per the readme) and ran the following commands with the corresponding best (train/dev/test) results:
bash scripts/train.sh bert-base-multilingual-cased pawsx
eval_results
cc_best = 0.9374687343671836
num_best = 1999
correct_best = 1874
acc_MaxLen128 = 0.9349674837418709
num_MaxLen128 = 1999
correct_MaxLen128 = 1869
...
eval_test_results
...
======= Predict using the model from checkpoint-400:
de=0.5757878939469735
en=0.5462731365682841
es=0.6068034017008505
fr=0.5927963981990996
ja=0.688344172086043
ko=0.742871435717859
zh=0.5957978989494748
total=0.6212391910240834
...
bash scripts/train.sh xlm-roberta-base pawsx
eval_results
acc_best = 0.9374687343671836
num_best = 1999
correct_best = 1874
acc_MaxLen128 = 0.9309654827413707
num_MaxLen128 = 1999
correct_MaxLen128 = 1861
...
eval_test_results
...
======= Predict using the model from checkpoint-2800:
de=0.575287643821911
en=0.545272636318159
es=0.5732866433216608
fr=0.5737868934467234
ja=0.6588294147073537
ko=0.7218609304652326
zh=0.6068034017008505
total=0.6078753662545558
...
-
bash scripts/train.sh xlm-roberta-large pawsx
I cut this training short because the training and dev losses were not changing, and the test accuracy stayed fixed at
1.0
, which seems off. Not sure why this was so.
Do you think these results were due to an "unlucky" initial configuration and that I should re-run these commands a few more times? I believe there is still some stochasticity here with the optimizer despite the model being loaded to a fixed initial checkpoint.
Fine-tuning models
We are not currently planning to release fine-tuned models as this would mean we would need to release one model for each task.
I understand. Hmm, would it be possible to share the hyperparameters and training arguments that were used to fine-tune the best performing model so far for PAWS-X (eg. for those listed on the leaderboard)? I could perhaps try to reproduce the best model with those values.
from xtreme.
Just to add more information for this issue:
I ran bash scripts/train.sh xlm-roberta-large pawsx
for XLM-R (large) again but this time reduced the default learning rate from LR=2e-5
to LR=1e-6
in train_pawsx.sh
(in contrast to the learning rate of 3e-5
mentioned in the XTREME paper Appendix B). Based on this, here is a snippet of the current results in eval_test_results
.
======= Predict using the model from checkpoint-200:
de=1.0
en=1.0
es=1.0
fr=1.0
ja=1.0
ko=1.0
zh=1.0
total=1.0
======= Predict using the model from checkpoint-400:
de=1.0
en=1.0
es=1.0
fr=1.0
ja=1.0
ko=1.0
zh=1.0
total=1.0
======= Predict using the model from checkpoint-600:
de=1.0
en=1.0
es=1.0
fr=1.0
ja=1.0
ko=1.0
zh=1.0
total=1.0
======= Predict using the model from checkpoint-800:
de=0.870935467733867
en=0.6323161580790395
es=0.7293646823411706
fr=0.840920460230115
ja=0.9354677338669335
ko=0.9459729864932466
zh=0.8099049524762382
total=0.8235546344600871
======= Predict using the model from checkpoint-1000:
de=0.6728364182091046
en=0.39419709854927465
es=0.542271135567784
fr=0.6188094047023511
ja=0.8344172086043021
ko=0.8659329664832416
zh=0.6953476738369184
total=0.6605445579932824
Observations and questions
In checkpoint-800
, the highest test accuracy appears to have been achieved and is in the ballpark with the current highest score in the XTREME leaderboard for sentence classification. However, the model at checkpoint-800
would not have been considered the best model since the dev accuracies keep rising after that.
Is there a reason why the first few checkpoints (above) all show an accuracy of 1.0
? Strangely, in all the checkpoints where test accuracy was 1.0
, the dev accuracies were fixed at a value of 0.568784392196098
. I am not sure if this is some error or if the test sets were classified completely correctly.
from xtreme.
Hi @atreyasha , the testing score is around 0.5 because we remove the true class label for each example in the test set, and use a fake placeholder ("0") as the label for all the testing examples. Because we want to encourage the users to get their predictions and submit to our XTREME benchmark. Removing the label is intentional in order to avoid trivial submissions. At the beginning of the training, the model is not well trained and predict all zeros for all the examples. That's why you would observe testing accuracy to be 1 at the beginning.
If you want to use the best system for PAWS-X, you may refer to the dev set accuracy which is accurate, and we also find that the scores on the dev and test sets are well correlated.
from xtreme.
Hi @JunjieHu. Thank you, this makes sense now. Closing the issue.
from xtreme.
Related Issues (20)
- explainaboard doesn't work
- There is no AmazonPhotos.zip in the link to download the dataset. How to download the dataset?
- WikiExtractor git cloning error HOT 2
- New tasks in xtreme-r HOT 5
- XLMR results on POS tagging (ja, zh and yo) HOT 2
- Unclear which transformers version should be used when testing Tatoeba HOT 6
- 2513 Segmentation fault: 11 when running Tatoeba HOT 1
- Cannot achieve the XNLI performance in the paper
- TypeError: compute_predictions_logits() takes 12 positional arguments but 13 were given
- XCopa scripts?
- Metrics for sequence tagging tasks dependant on the max_seq_length parameter HOT 1
- Old Russian in Russian UDpos dataset
- Training data for XCOPA
- Evaluation results of PANX task HOT 3
- How to get LAReQA results across different question languages HOT 5
- How is the test data for TyDiQA generated? HOT 3
- Adding a new language
- Tatoeba baseline from XTREME-R
- Missing de translation data in MLQA HOT 1
- [bugs] All results 0 in test with the NER(PANX) task HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from xtreme.