Git Product home page Git Product logo

dataset-factoid-webquestions's Introduction

WebQuestions QA Benchmarking Dataset

WebQuestions (http://nlp.stanford.edu/software/sempre/ - Berant et al., 2013, CC-BY) is a popular dataset for benchmarking QA engines, especially ones that work on structured knowledge bases.

This is an effort to make the dataset more organized and easier to use. We assign a unique ID to each question, and distribute extra annotations for each question, e.g. pertaining the questions and Freebase. We also provide several topic-based splits.

This is a development version of the dataset. If you use it, please cite the Git repository and date + shortid of the last commit. There are no stability guarantees yet both regarding the data format and the actual set of questions. This dataset is distributed under the terms of the CC-BY 4.0 licence.

  • main/ has the dataset splits as distributed.
  • d-dump/ has question dumps from YodaQA.
  • d-freebase/ has mappings from question to single Freebase key, as distributed with the original WebQuestions dataset.
  • d-freebase-mids/ has Freebase mids for each concept in each question, based on YodaQA entity linking results from d-dump.
  • d-freebase-rp/ has extra custom-computed Freebase relation paths.
  • d-freebase-brp/ has extra custom-computed branched Freebase relation paths (superset of d-freebase-rp).
  • d-entities/ has entity occurences detected in question texts (generated by Yao et al. for Jacana).
  • t-movies/ has question sub-splits pertaining to the movies topic.

Some of the subsplits (d-dump, d-freebase-*) are autogenerated using a question answering system (YodaQA). To regenerate these, you can run scripts/dump-refresh.sh.

Splits

The original WebQuestions dataset has train (3778 q) and test (2032 q) splits. We keep all the questions within their respective splits, but separate train to a few further splits for the benefit of application of machine learning methods:

  • devtest (189 q): A set of questions to use for development but not to train models on. You may want to use these to decide which features to add, then check if the features generalize even for questions the model was not trained on.

  • val (755 q): A validation set - questions you do not use for training the model but for testing its performance when tuning the model (set of features, hyperparameters...). By using this set instead of the test set, you are making sure that you are not overfitting for the test set indirectly. Ideally, you would completely withhold the test set, reporting only the final performance of your model on it, and use the val set for day-to-day development.

  • trainmodel (2834 q): A set of questions to use for model training.

Using these further splits is optional, just make sure to report whether you used the sub-splits or the complete train split for training (e.g. you are testing just a trivial model that requires no tuning, or simply want to be very strict about comparability with past research). It is also okay to use devtest and val combination as a validation set.

To generate a .json file with the full train split, run scripts/mktrain.py.

Data Model

The questions have identifiers in the form of "wqr%06d" (train) or "wqs%06d" (test) respectively, where %06d is six-digit number assigned based on the original dataset order. The master JSON files consist of an array with single object per question, where each object has string attribute "qId", string attribute "qText" and an attribute "answers" which is an array of strings.

Various extra data are present in the d-*/ directories (see the respective READMEs). To build a single file per split with full data for each question, run scripts/fulldata.py.

To build a YodaQA-compatible dataset in the TSV format, run scripts/json2tsv.pl.

dataset-factoid-webquestions's People

Contributors

pasky avatar pichljan avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.