Git Product home page Git Product logo

tac-entity-linking's Introduction

TAC-Entity-Linking

The entity-linking task consists in linking name-entity mentions in a document collection to the correct entity in a Knowledge Base, dealing with disambiguation. For instance, the string-name Washingon in a document might represent:

  • George Washington (1732โ€“1799), first president of the United States
  • Washington (state), United States
  • Washington, D.C., the capital of the United States
  • NIL in case the Washington which the document is refering to does not exist in the KB

This repository contains the code for an entity linking prototype, trained and tested with the datasets from the TAC Entity-Linking sub-task. A detaied description of the prototype system can be found in this report.

Data

Training Data

Each training query consists of a:

  • name_string: string representing the named-entity, of 3 possible types (i.e.: GPE, PER, ORG)
  • doc_id: the id of a support document where the name-entity occurs
  • kb_id: the id in the Knowlege Base (Wikipedia) corresponding to the correct entity in Wikipedia

Pre-Processing

  • Create a dictionary of alternative names based on:
    • Acronyms expansion
    • Wikipedia redirect pages
  • The dictionary is kept in a REDIS instance
  • Create 3 Lucene Indexes
  • KB/Wikipedia article names
  • KB/Wikipedia full-text
  • Source Document Collection

Query Expansion

  • Get all possible alternative names/senses for a given query string
  • Extract the top-k articles, for each sense/name, from the KB/Wikipedia using the Lucene Index

Candidates Generation

  • Extract features for each candidate instance retrieved from the KB
    • Topic Similarities (LDA)
    • String-Name Similarities
    • Textual Similarities
    • Graph Structure

Candidates Ranking

  • Pairwise learning to rank: SVMRank

  • Correct answer is ranked as first

  • All others candidates are ranked as second

  • A Graph-based Method for Entity Linking which exploits the graph structure of Wikipedia to perform named entity disambiguation based on two measures

    • Out-Degree: nodes in the graph consist of named entities present in the support document which also correspond to entities in the KB, and the text articles of the candidates. There is a directed edge from an article node to a name node when the name is mentioned in the article.

    • In-Degree: nodes are the name string of the candidates entities and the text articles of named entities, which are also entities in the KB, present in the support document. There is a directed edge that links an article to a candidate name string when the article of a context named entity contains that name.

NIL Detection

  • An SVM to distiguinch between a correct candidate and a NIL (i.e., there is no KB representation)
  • Features
  • Score
  • Mean Score
  • Difference to Mean Score
  • Standard Deviation
  • Dixion's Q Test for Outliers
  • Grubb's Test for Outliers

tac-entity-linking's People

Contributors

davidsbatista avatar

Watchers

James Cloos avatar SHEN HONGMEI avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.