reStructured Pretraining (All are in the Paper)
- 2022-09-04: release all models at Huggingface
- 2022-08-15: release all data at DataLab
The iteration of technology always moves along the direction that system developers can design a better and more general system by doing fewer things.
This paradigm regards model pre-training/tuning as a data storing/accessing process, and claims that a good storing mechanism should make expected data easily accessible.
This dataset is a precious treasure, containing a variety of naturally occurring signals. Any downstream task you can think of (e.g., the college entrance exam mentioned in the RST paper) can benefit from being pre-trained on some of our provided signals. We spent several months collecting the following 29 signal types, accounting for a total of 46,926,447 data samples. We hope this dataset will be a valuable asset for everyone in natural language processing research.
We provide collected signals through DataLab. For efficiency, we only provide 50,000 samples at most for each signal type. If you want all the samples we collected, please fill this form. More specifically, we collected the following signals.
One example:
# pip install datalabs
from datalabs import load_dataset
rst = load_dataset("rst", "wikipedia_entities")
We will be happy 😃 to know if the resource is helpful for your work, and please cite our work 😊
Mine | Signal | #Sample | Use in DataLab | Some Applications |
---|---|---|---|---|
Rotten Tomatoes | (review, rating) | 5,311,109 | load_dataset("rst", "rotten_tomatoes_sentiment") |
Sentiment classification |
Daily Mail | (text, category) | 899,904 | load_dataset("rst", "daily_mail_category") |
Topic classification |
Daily Mail | (title, text, summary) | 1,026,616 | load_dataset("rst", "daily_mail_summary") |
Summarization; Sentence expansion |
Daily Mail | (text, events) | 1,006,412 | load_dataset("rst", "daily_mail_temporal") |
Temporal reasoning |
Wikidata | (entity, entity_type, text) | 2,214,274 | load_dataset("rst", "wikidata_entity") |
Entity typing |
Wikidata | (subject, object, relation, text) | 1,526,674 | load_dataset("rst", "wikidata_relation") |
Relation extraction; Fact retrieval |
wikiHow | (text, category) | 112,109 | load_dataset("rst", "wikihow_text_category") |
Topic classification |
wikiHow | (low_category, high_category) | 4,868 | load_dataset("rst", "wikihow_category_hierarchy") |
Relation extraction; Commonsense reasoning |
wikiHow | (goal, steps) | 47,956 | load_dataset("rst", "wikihow_goal_step") |
Intent detection |
wikiHow | (text, summary) | 703,278 | load_dataset("rst", "wikihow_summary") |
Summarization; Sentence expansion |
wikiHow | (goal, first_step, second_step) | 47,787 | load_dataset("rst", "wikihow_procedure") |
Temporal reasoning |
wikiHow | (question, description, answer, related_questions) | 47,705 | load_dataset("rst", "wikihow_question") |
Question generation |
Wikipedia | (text, entities) | 22,231,011 | load_dataset("rst", "wikipedia_entities") |
Entity recognition |
Wikipedia | (texts, titles) | 3,296,225 | load_dataset("rst", "wikipedia_sections") |
Summarization |
WordNet | (word, sentence, pos) | 27,123 | load_dataset("rst", "wordnet_pos") |
Part-of-speech tagging |
WordNet | (word, sentence, meaning, possible_meanings) | 27,123 | load_dataset("rst", "wordnet_meaning") |
Word sense disambiguation |
WordNet | (word, sentence, synonyms) | 17,804 | load_dataset("rst", "wordnet_synonym") |
Paraphrasing |
WordNet | (word, sentence, antonyms) | 6,408 | load_dataset("rst", "wordnet_antonym") |
Negation |
ConTRoL | (premise, hypothesis, label) | 8,323 | load_dataset("rst", "qa_control") |
Natural language inference |
DREAM | (context, question, options, answer) | 9,164 | load_dataset("rst", "qa_dream") |
Reading comprehension |
LogiQA | (context, question, options, answer) | 7,974 | load_dataset("rst", "qa_logiqa") |
Reading comprehension |
ReClor | (context, question, options, answer) | 5,138 | load_dataset("rst", "qa_reclor") |
Reading comprehension |
RACE | (context, question, options, answer) | 44,880 | load_dataset("rst", "qa_race") |
Reading comprehension |
RACE-C | (context, question, options, answer) | 5,093 | load_dataset("rst", "qa_race_c") |
Reading comprehension |
TriviaQA | (context, question, answer) | 46,636 | load_dataset("rst", "qa_triviaqa") |
Reading comprehension |
Arxiv | (text, category) | 1,696,348 | load_dataset("rst", "arxiv_category") |
Topic classification |
Arxiv | (text, summary) | 1,696,348 | load_dataset("rst", "arxiv_summary") |
Summarization; Sentence expansion |
Paperswithcode | (text, entities, datasets, methods, tasks, metrics) | 4,731,233 | load_dataset("rst", "paperswithcode_entity") |
Entity recognition |
Paperswithcode | (text, summary) | 120,924 | load_dataset("rst", "paperswithcode_summary") |
Summarization; Sentence expansion |
Check all models at Huggingface
We release all models introduced in our paper, covering 13 different application scenarios. Each model contains 11 billion parameters.
Model | Description | Recommended Application |
---|---|---|
rst-all-11b | Trained with all the signals below except signals that are used to train Gaokao models | All applications below (specialized models are recommended first if high performance is preferred) |
rst-fact-retrieval-11b | Trained with the following signals: WordNet meaning, WordNet part-of-speech, WordNet synonym, WordNet antonym, wikiHow category hierarchy, Wikidata relation, Wikidata entity typing, Paperswithcode entity typing | Knowledge intensive tasks, information extraction tasks,factual checker |
rst-summarization-11b | Trained with the following signals: DailyMail summary, Paperswithcode summary, arXiv summary, wikiHow summary | Summarization or other general generation tasks, meta-evaluation (e.g., BARTScore) |
rst-temporal-reasoning-11b | Trained with the following signals: DailyMail temporal information, wikiHow procedure | Temporal reasoning, relation extraction, event-based extraction |
rst-information-extraction-11b | Trained with the following signals: Paperswithcode entity, Paperswithcode entity typing, Wikidata entity typing, Wikidata relation, Wikipedia entity | Named entity recognition, relation extraction and other general IE tasks in the news, scientific or other domains |
rst-intent-detection-11b | Trained with the following signals: wikiHow goal-step relation | Intent prediction, event prediction |
rst-topic-classification-11b | Trained with the following signals: DailyMail category, arXiv category, wikiHow text category, Wikipedia section title | general text classification |
rst-word-sense-disambiguation-11b | Trained with the following signals: WordNet meaning, WordNet part-of-speech, WordNet synonym, WordNet antonym | Word sense disambiguation, part-of-speech tagging, general IE tasks, common sense reasoning |
rst-natural-language-inference-11b | Trained with the following signals: ConTRoL dataset, DREAM dataset, LogiQA dataset, RACE & RACE-C dataset, ReClor dataset, DailyMail temporal information | Natural language inference, multiple-choice question answering, reasoning |
rst-sentiment-classification-11b | Trained with the following signals: Rotten Tomatoes sentiment, Wikipedia sentiment | Sentiment classification, emotion classification |
rst-gaokao-rc-11b | Trained with multiple-choice QA datasets that are used to train the T0pp model | General multiple-choice question answering |
rst-gaokao-cloze-11b | Trained with manually crafted cloze datasets | General cloze filling |
rst-gaokao-writing-11b | Trained with example essays from past Gaokao-English exams and grammar error correction signals | Essay writing, story generation, grammar error correction and other text generation tasks |
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("XLab/rst-all-11b")
model = AutoModelForSeq2SeqLM.from_pretrained("XLab/rst-all-11b")
inputs = tokenizer.encode("TEXT: this is the best cast iron skillet you will ever buy. QUERY: Is this review \"positive\" or \"negative\"", return_tensors="pt")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
See more detail
@article{yuan2022restructured,
title={reStructured Pre-training},
author={Yuan, Weizhe and Liu, Pengfei},
journal={arXiv preprint arXiv:2206.11147},
year={2022}
}