Git Product home page Git Product logo

re-nlg-dataset's Introduction

T-Rex : A Large Scale Alignment of Natural Language with Knowledge Base Triples

This repository contains the extraction framework for T-REx Dataset. More details found at: T-REx Website

Paper accepted to LREC2018 link

@inproceedings{DBLP:conf/lrec/ElSaharVRGHLS18,
  author    = {Hady ElSahar and
               Pavlos Vougiouklis and
               Arslen Remaci and
               Christophe Gravier and
               Jonathon S. Hare and
               Fr{\'{e}}d{\'{e}}rique Laforest and
               Elena Simperl},
  title     = {T-REx: {A} Large Scale Alignment of Natural Language with Knowledge
               Base Triples},
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources
               and Evaluation, {LREC} 2018, Miyazaki, Japan, May 7-12, 2018.},
  year      = {2018},
  crossref  = {DBLP:conf/lrec/2018},
  timestamp = {Fri, 18 May 2018 10:35:14 +0200},
  biburl    = {https://dblp.org/rec/bib/conf/lrec/ElSaharVRGHLS18},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Setup

For the English version of the dataset run startup_multilang.sh

For multilingual versions of the dataset run it with equivalent language code (es, eo, ar are supported) e.g. startup_multilang.sh es

to run DBpedia spotlight server on port 2222 run in a separate session

cd dbpedia-spotlight
# dbpedia spotlight server needs at least 6gb of ram
java -Xmx6g -jar dbpedia-spotlight-latest.jar en http://localhost:2222/rest 

Knowledge Base Dumps

DBpedia

DBPedia triples and same as links are downloaded automatically with the setup.sh script

Wikidata

Downloaded automatically using the setup.sh script

Wikidata provides a tool for exporting RDF dumps

Simple RDF dumps were used in which each statement is represented in a triple and statements with qualifiers are omitted Wikidata RDF dumps 20160801

Sameas links between Wikidata and DBpedia are already extracted and can be found on wikidata.dbpedia.org

The latest available version in this project is available on the extraction page from 20150330

The downloaded dump is available here: 20150330-sameas-all-wikis.ttl.bz2

Text Dumps

Wikipedia Articles dump

go to ./datasets/Wikipedia/ run "setup.sh" the script will download wikipedia latest dump and extract text in articles.

DBpedia Abstracts dump

go to ./datasets/wikipedia-abstracts/ run "setup.sh" the script will download dbpedia abstracts latest dump from DBpedia website and extract text in article

Output Format :

All of the modules in the pipeline take the a single json file [as described below] and outputs the same file after filling in some of its attributes.

  {
        "docid":                   Document id     -- Wikipedia document id when dealing with wikipedia dump
        "title":                    title of the wikipedia document
        "uri":                      URI of the item containing the main page
        "text":                     The whole text of the document
        "sentences_boundaries":                start and end offsets of sentences
                                    [(start,end),(start,end)] start/ end are character indices
        "words_boundaries":                                      # list of tuples (start, end) of each word in Wikipedia Article, start/ end are character indices
        "entities":                                             # list of Entities   (Class Entity)
                                    [
                                    {
                                    "uri":
                                    "boundaries": (start,end)   # tuple containing the of the surface form of the entity
                                    "surface-form": ""
                                    "annotator" : ""            # the annotator name used to detect this entity [NER,DBpediaspotlight,coref]
                                    }
                                    ]
        "triples":                  list of triples that occur in the document
                                    We opt of having them exclusive of other fields so they can be self contained and easy to process
                                    [
                                    {
                                    "subject":          class Entity
                                    "predicate":        class Entity
                                    "object":           class Entity
                                    "dependency_path": "lexicalized dependency path between sub and obj if exists" or None (if not existing)
                                    "confidence":      # confidence of annotation if possible
                                    "annotator":       # annotator used to annotate this triple with the sentence
                                    "sentence_id":     # integer shows which sentence does this triple lie in
                                    }
                                    ]
    }

re-nlg-dataset's People

Contributors

hadyelsahar avatar arslen avatar luciekaffee avatar pvougiou avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.