Git Product home page Git Product logo

embodimentgeniuslm3 / wiki-reading Goto Github PK

View Code? Open in Web Editor NEW

This project forked from google-research-datasets/wiki-reading

0.0 1.0 0.0 13.44 MB

This repository contains the three WikiReading datasets as used and described in WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia, Hewlett, et al, ACL 2016 (the English WikiReading dataset) and Byte-level Machine Reading across Morphologically Varied Languages, Kenter et al, AAAI-18 (the Turkish and Russian datasets).

License: Other

Shell 25.49% Python 74.51%

wiki-reading's Introduction

WikiReading

This repository contains the three WikiReading datasets as used and described in WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia, Hewlett, et al, ACL 2016 (the English WikiReading dataset) and Byte-level Machine Reading across Morphologically Varied Languages, Kenter et al, AAAI-18 (the Turkish and Russian datasets).

Run get_data.sh to download data the English WikiReading dataset.

Run get_ru_data.sh and get_tr_data.sh to get the Russian and Turkish version of the WikiReading data, respectively.

If you use the data or the results reported in the papers, please feel free to cite them.

@inproceedings {hewlett2016wikireading,
 title = {{WIKIREADING}: A Novel Large-scale Language Understanding Task over {Wikipedia}},
 booktitle = {Proceedings of the The 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016)},
 author = {Daniel Hewlett and Alexandre Lacoste and Llion Jones and Illia Polosukhin and Andrew Fandrianto and Jay Han and Matthew Kelcey and David Berthelot},
 year = {2016}
}

and

@inproceedings{byte-level2018kenter,
  title={Byte-level Machine Reading across Morphologically Varied Languages},
  author={Tom Kenter and Llion Jones and Daniel Hewlett},
  booktitle={Proceedings of the The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)},
  year={2018}
}

WikiReading Data

Train, validation, and test datasets are in TFRecord or streamed JSON (one JSON object per line). They are 45GB, 5GB, and 3GB respectively. For example test.tar.gz contains 15 files whose union is the whole test set. We split them to help speed up training/testing by parallelizing reads. Any one of the shards can be opened with a TFRecordReader or with your favorite JSON reader for every line. Download a sample TFRecord shard or a sample JSON shard of the validation set (1/15th) to play around with if disk space is limited.

English

file size description
train 16,039,400 examples TFRecord https://storage.googleapis.com/wikireading/train.tar.gz
JSON https://storage.googleapis.com/wikireading/train.json.tar.gz
validation 1,886,798 examples TFRecord https://storage.googleapis.com/wikireading/validation.tar.gz
JSON https://storage.googleapis.com/wikireading/validation.json.tar.gz
test 941,280 examples TFRecord https://storage.googleapis.com/wikireading/test.tar.gz
JSON https://storage.googleapis.com/wikireading/test.json.tar.gz
document.vocab 176,978 tokens vocabulary for tokens from Wikipedia documents
answer.vocab 10,876 tokens vocabulary for tokens from answers
raw_answer.vocab 1,359,244 tokens vocabulary for whole answers as they appear in WikiData
type.vocab 80 tokens vocabulary for Part of Speech tags
character.vocab 12486 tokens vocabulary for all characters that appear in the string sequences

Russian

file size description
train 4,259,667 examples TFRecord https://storage.googleapis.com/wikireading/ru/train.tar.gz
JSON https://storage.googleapis.com/wikireading/ru/train.json.tar.gz
validation 531,412 examples TFRecord https://storage.googleapis.com/wikireading/ru/valid.tar.gz
JSON https://storage.googleapis.com/wikireading/ru/valid.json.tar.gz
test 533,026 examples TFRecord https://storage.googleapis.com/wikireading/ru/test.tar.gz
JSON https://storage.googleapis.com/wikireading/ru/test.json.tar.gz
document.vocab 965,157 tokens vocabulary for tokens from Wikipedia documents
answer.vocab 57,952 tokens vocabulary for tokens from answers
type.vocab 56 tokens vocabulary for Part of Speech tags
character.vocab 12,205 tokens vocabulary for all characters that appear in the string sequences

Turkish

file size description
train 654,705 examples TFRecord https://storage.googleapis.com/wikireading/tr/train.tar.gz
JSON https://storage.googleapis.com/wikireading/tr/train.json.tar.gz
validation 81,622 examples TFRecord https://storage.googleapis.com/wikireading/tr/valid.tar.gz
JSON https://storage.googleapis.com/wikireading/tr/valid.json.tar.gz
test 82,643 examples TFRecord https://storage.googleapis.com/wikireading/tr/test.tar.gz
JSON https://storage.googleapis.com/wikireading/tr/test.json.tar.gz
document.vocab 215,294 tokens vocabulary for tokens from Wikipedia documents
answer.vocab 11,123 tokens vocabulary for tokens from answers
type.vocab 10 tokens vocabulary for Part of Speech tags
character.vocab 6638 tokens vocabulary for all characters that appear in the string sequences

Features

Each instance contains these features (some features may be empty).

feature name description
answer_breaks Indices into answer_ids and answer_string_sequence.
Used to delimit multiple answers to a question, e.g. a list answer.
answer_ids answer.vocab ID sequence for words in the answer.
answer_location Word indices into the document where any one token in the answer was found.
answer_sequence document.vocab ID sequence for words in the answer.
answer_string_sequence String sequence for the words in the answer.
break_levels One integer [0,4] indicating a break level for each word in the document.
* 0 = no separation between tokens
* 1 = tokens separated by space
* 2 = tokens separated by line break
* 3 = tokens separated by sentence break
* 4 = tokens separated by paragraph break
document_sequence document.vocab ID sequence for words in the document.
full_match_answer_location Word indices into the document where all contiguous tokens in answer were found.
paragraph_breaks Word indices into the document indicating a paragraph boundary.
question_sequence document.vocab ID sequence for words in the question.
question_string_sequence String sequence for the words in the question.
raw_answer_ids raw_answer.vocab ID for the answer.
raw_answers A string containing the raw answer.
sentence_breaks Word indices into the document indicating a sentence boundary.
string_sequence String sequence for the words in the document. character.vocab for char IDs.
type_sequence type.vocab ID sequence for tags (POS, type, etc.) in the document.

wiki-reading's People

Contributors

laundry avatar dave-orr avatar llionj avatar danielhewlett avatar ilblackdragon avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.