Git Product home page Git Product logo

tagme-reproducibility's Introduction

TAGME reproducibility

This repository contains resources developed within the following paper:

F. Hasibi, K. Balog, and S.E. Bratsberg. “On the reproducibility of the TAGME Entity Linking System”,
In proceedings of 38th European Conference on Information Retrieval (ECIR ’16), March 2016.

This study is an effort aimed at reproducing the results presented in the TAGME paper [1].

We received invaluable comments from the TAGME authors about their system, and we made these notes available here. These comments may inform future efforts related to the re-implementation of the TAGME system, as they cannot be found in the original paper.

This repository is structured as follows:

  • nordlys/: Code required for running entity linkers.
  • scripts/: Evaluation scripts.
  • lib/: Contains libraries.
  • run-scripts.sh: Single script that runs all the scripts for getting the results of the paper.
  • authors_comments.md: Comments from the TAGME authors and notes about our experiments.

Other resources involved in this project are data, qrels, and runs, which are described below.

Note: Before running the code (run-scripts.sh), please read the setup file and build all the required resources.

Data

The following data files can be downloaded from here:

  • Wiki-disamb30 and Wiki-annot30: The original datasets are published here. We complement the snippets with numerical IDS, as IDs are not contained in the original datasets.
  • ERD-dev: The dataset is originally published by the ERD Challenge; we use it in our generalizability experiments. The files related to this dataset are prefixed with Trec_beta.
  • Y-ERD: This dataset is originally published in [2] and is available here. The dataset is used in our generalizability experiments.
  • Freebase snapshot: A snapshot of Freebase containing only proper noun entities (e.g., people and locations) is made available by the ERD challenge and is used for filtering entities in the generalizability experiments.

Qrels

The qrel files can be downloaded from here. All qrels are tab-delimited and their format is as follows:

  • Wiki-disamb30 and Wiki-annot30: The columns represent: snippet ID, confidence score, Wikipedia URI, and Wikipedia page id. The last column is not considered in the evaluation scripts.
  • ERD-dev and Y-ERD: The columns represent: query ID, confidence score (always 1), and Wikipedia URI. The entities after the second column represent an interpretation set (entity set) of the query. (If a query has multiple interpretations, there are multiple lines with that query ID.)

Runs

The run files can be downloaded from here, and categorized into two groups: reproducibility and generalizability.

  • Reproducibility: The naming convention for these files is XX_YY.txt, where XX represents the dataset and YY is the name of the method. For each file, only the first 4 columns are considered for the evaluation, which are: snippet ID, confidence score, Wikipedia URI, and mention.
  • Generalizability: These files are named as XX_YY_ZZ.elq, where XX is the dataset, YY is the name of the method, and ZZ is the entity linking threshold used for evaluation. The format of these files is similar to the corresponding qrel files.

Contact

If you have any questions, feel free to contact Faegheh Hasibi at [email protected].

[1] P. Ferragina and U. Scaiella. TAGME: On-the-fly annotation of short text fragments (by Wikipedia entities). In Proceedings of CIKM '10, pages 1625–1628, 2010.
[2] F. Hasibi, K. Balog, and S. E. Bratsberg. Entity Linking in Queries: Tasks and Evaluation. In Proceedings of ICTIR ’15, pages 171–180, 2015.

tagme-reproducibility's People

Contributors

hasibi avatar kbalog avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.