Git Product home page Git Product logo

kelm-corpus's Introduction

For details describing the origin of this dataset, please refer to "Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training", Oshin Agarwal, Heming Ge, Siamak Shakeri, Rami Al-Rfou.

This corpus consists of two parts: TEKGEN (Text From KG Generatiom) training corpus and the generated synthetic KELM (Knowledge Enhanced Language Model Pre-training) corpus.

Part 1: TEKGEN Training Corpus

This is the Wikipedia text--Wikidata KG aligned corpus used to train the data-to-text generation model. Please note that this is a corpus generated with distant supervision and should not be used as gold standard for evaluation.

It consists of 3 files:

  1. https://storage.googleapis.com/gresearch/kelm-corpus/updated-2021/quadruples-train.tsv
  2. https://storage.googleapis.com/gresearch/kelm-corpus/updated-2021/quadruples-validation.tsv
  3. https://storage.googleapis.com/gresearch/kelm-corpus/updated-2021/quadruples-test.tsv

Each file contains one example per line. Each example is a json object with three fields:

  1. triples: A list of triples of the form (subject, relation, object). eg. (Person X, award received, Award Y). If the triple has a subproperty, then it is quadruple instead. eg. (Person X, Award Y, received on, Date Z).

  2. serialized triples: triples concatenated together as used for input to T5. The format is "<subject> <relation> <object>" where some subjects have multiple relations, e.g. "<subject> <relation1> <object1> <relation2> <object2> <relation3> <object3>". For more details on how these relations are grouped, please refer to the paper.

  3. sentence: The wikipedia sentence aligned to these triples.

The names, aliases and Wikidata Ids of the entities can be found in https://storage.googleapis.com/gresearch/kelm-corpus/updated-2021/entities.jsonl.

Part 2: KELM Corpus

This is a synthetic corpus that consists of the entire Wikidata KG as natural text sentences. It has ~15M sentences synthetically generated using a T5 model fine-tuned on the data from Part 1 with some additional components. It can be used as additional data in language model pre-training as a means to integrate KGs with natural text.

https://storage.googleapis.com/gresearch/kelm-corpus/updated-2021/kelm_generated_corpus.jsonl

Each line is an example as a json object with three fields:

  1. triples: A list of triples of the form (subject, relation, object). eg. (Person X, award received, Award Y). If the triple has a subproperty, then it is quadruple instead. eg. (Person X, Award Y, received on, Date Z). These triples are entity subgraphs as described in the paper.

  2. serialized triples: triples concatenated together as used for input to T5. The format is "<subject> <relation> <object>" where some subjects have multiple relations, e.g. "<subject> <relation1> <object1> <relation2> <object2> <relation3> <object3>". For more details on how these relations are grouped, please refer to the paper.

  3. gen_sentence: The generated natural language sentence for the triples.

About 0.1% of the examples in kelm_generated_corpus.jsonl are missing the "triples" field.

The names, aliases and Wikidata Ids of the entities can be found in https://storage.googleapis.com/gresearch/kelm-corpus/updated-2021/entities.jsonl.

License

This dataset has been released under the CC BY-SA 2.0 license.

kelm-corpus's People

Contributors

oagarwal avatar hemingge-google avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.