Git Product home page Git Product logo

winogrande's Introduction

WinoGrande

Version 1.1


Data

Download dataset by download_winogrande.sh

./data/
├── train_[xs,s,m,l,xl].jsonl          # training set with differnt sizes
├── train_[xs,s,m,l,xl]-labels.lst     # answer labels for training sets
├── dev.jsonl                          # development set
├── dev-labels.lst                     # answer labels for development set
├── test.jsonl                         # test set
├── sample-submissions-labels.lst      # example submission file for leaderboard    
└── eval.py                            # evaluation script

You can use train_*.jsonl for training models and dev for validation. Please note that labels are not included in test.jsonl. To evaluate your models on test set, make a submission to our leaderboard.

Run experiments

Setup

  1. Download dataset by download_winogrande.sh
  2. pip install -r requirements.txt

Training (fine-tuning)

  1. You can train your model by ./scripts/run_experiment.py (see sample_training.sh).

     e.g., 
     export PYTHONPATH=$PYTHONPATH:$(pwd)
    
     python scripts/run_experiment.py \
     --model_type roberta_mc \ 
     --model_name_or_path roberta-large \
     --task_name winogrande \
     --do_eval \
     --do_lower_case \
     --data_dir ./data \
     --max_seq_length 80 \
     --per_gpu_eval_batch_size 4 \
     --per_gpu_train_batch_size 16 \
     --learning_rate 1e-5 \
     --num_train_epochs 3 \
     --output_dir ./output/models/ \
     --do_train \
     --logging_steps 4752 \
     --save_steps 4750 \
     --seed 42 \
     --data_cache_dir ./output/cache/ \
     --warmup_pct 0.1 \
     --evaluate_during_training
    
  2. If you have an access to beaker, you can run your experiments by sh ./train_winogrande_on_bkr.sh.

  3. Results will be stored under ./output/models/.

Prediction (on the test set)

  1. You can make predictions by ./scripts/run_experiment.py directly (see sample_prediction.sh).

     e.g., 
     export PYTHONPATH=$PYTHONPATH:$(pwd)
    
     python scripts/run_experiment.py \
     --model_type roberta_mc \
     --model_name_or_path .output/models \
     --task_name winogrande \
     --do_predict \
     --do_lower_case \
     --data_dir ./data \
     --max_seq_length 80 \
     --per_gpu_eval_batch_size 4 \
     --output_dir ./output/models/ \
     --data_cache_dir ./output/cache/ \
    
  2. If you have an access to beaker, you can run your experiments by sh ./predict_winogrande_on_bkr.sh.

  3. Result is stored in ./output/models/predictions_test.lst

Evaluation

You can use eval.py for evaluation on the dev split, which yields metrics.json.

e.g., python eval.py --preds_file ./YOUR_PREDICTIONS.lst --labels_file ./dev-labels.lst

In the prediction file, each line consists of the predictions (1 or 2) by 5 training sets (ordered by xs, s, m, l, xl, separated by comma) for each evauation set question.

 2,1,1,1,1
 1,1,2,2,2
 1,1,1,1,1
 .........
 .........

Namely, the first column is the predictions by a model trained/finetuned on train_xs.jsonl, followed by a model prediction by train_s.jsonl, ... , and the last (fifth) column is the predictions by a model from train_xl.jsonl. Please checkout a sample submission file (sample-submission-labels.lst) for reference.

Submission to Leaderboard

You can submit your predictions on test set to the leaderboard. The submission file must be named as predictions.lst. The format is the same as above.

Reference

If you use this dataset, please cite the following paper:

@article{sakaguchi2019winogrande,
    title={WinoGrande: An Adversarial Winograd Schema Challenge at Scale},
    author={Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin},
    journal={arXiv preprint arXiv:1907.10641},
    year={2019}
}

License

Winogrande (codebase) is licensed under the Apache License 2.0. The dataset is licensed under CC-BY.

Questions?

Please file GitHub issues with your questions/suggestions. You may also ask us questions at our google group.

Contact

Email: keisukes[at]allenai.org

winogrande's People

Contributors

keisks avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

winogrande's Issues

per_gpu_eval_batch_size

Hi,
I followed the requirements.txt + torch==1.2.0, and fine-tuned a RoBERTa-large on train_m.jsonl with sample_prediction.sh (the only change is --max_seq_length 50).
I found that changing the value of per_gpu_eval_batch_size when inference(--do_eval) somehow leads to different dev results.
Specifically, when per_gpu_eval_batch_size <=10, I got acc_dev = 0.5082872928176796; otherwise, I got 0.5090765588003157.
Since you already used model.eval() and torch.no_grad() when inference,
I really wonder where does this inconsistency come from?

Thanks

hyperparameters for results described in paper

Hi, what are the specific hyperparameters for the results achieved in the paper (79.3 for roberta and 65.8 for bert)?
I saw the mention of the grid search but the exact values leading to the results were not mentioned. Thank you!

Reproducing the result on Winogrande

Hi there!

I am trying to reproduce the results reported in WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale, namely the score of RoBERTa trained and evaluated on Winogrande. Right now I am dissatisfied with the score I get and I suspect that the thing is with how I train the model. Do I get it right that the training objective for Winogrande was multiple choice of two options based on the sentence with _? So, is this correct that the input to RoBERTa is a plain sentence with _ and two options?

Also, would you mind sharing training details such as maximum sequence length and best hyperparameters found by grid-search?

Thanks!

Error in the description of context and option in paper?

Hello!

I'm trying to replicate your paper and I'm wondering if there is an error in the description of sentence split:

So this is the original sentence:

The input format becomes [CLS] context [SEP] option [SEP]; e.g., The trophy doesn’t fit into the brown suitcase because the ___ [SEP] is too large. [SEP] (The blank is filled with either option 1 or 2)

However if i understand correctly it should be

The trophy doesn’t fit into the brown suitcase because the [SEP] ___ is too large.

sentence = "Sarah was a much better surgeon than Maria so _ always got the easier cases"
conj = "_"
idx = sentence.index(conj)
context = sentence[:idx]
option_str = "_ " + sentence[idx + len(conj):].strip()
print(context)
print(option_str)

Sarah was a much better surgeon than Maria so
_ always got the easier cases

Fantastic efforts by the way!

Xiaoou

Reproducing Transfer-Learning from WinoGrande to WSC

Hi,

I'm trying to reproduce the results from the paper WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale, more specifically 6 Using WINOGRANDE as a Resource, where 77.6 accuracy is achieved on WSC task by a model finetuned on WinoGrande.

So far I tried running sample_training.sh and then feeding the output model to the sample_training_superglue-wsc.sh script. I got the following error:

RuntimeError: Error(s) in loading state_dict for RobertaForSequenceClassification:
  size mismatch for classifier.out_proj.weight: copying a param with shape torch.Size([1, 1024]) from checkpoint, the shape in current model is torch.Size([2, 1024]).
  size mismatch for classifier.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([2]).

Evidently, the classification heads are different for these two tasks. I was wondering whether it is possible to load the model without classification head using the code in the repository? Can you help me with that?

Also, would you mind sharing the hyperparameters used to achieve the results from the paper?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.