Git Product home page Git Product logo

amazonqa's Introduction

AmazonQA: A Review-Based Question Answering Task

The AmazonQA dataset is a large review-based Question Answering dataset (paper).

This repository comprises:

  • instructions to download and work with the dataset
  • implementations of preprocessing pipelines to re-generate the data for different configurations
  • analyses of the dataset
  • implementations of baseline models mentioned in the paper

Download Instructions

The dataset can be downloaded from the following links:

Format

The dataset is .jsonl format, where each line in the file is a json string that corresponds to a question, existing answers to the question and the extracted review snippets (relevant to the question).

Each json string has many fields. Here are the fields that the QA training pipeline uses:

  • questionText: String. The question.
  • questionType: String. Either "yesno" for a boolean question, or "descriptive" for a non-boolean question.
  • review_snippets: List of strings. Extracted review snippets relevant to the question (at most ten).
  • answers: List of dicts, one for each answer. Each dict has the following fields.
    • answerText: String. The text for the answer.
    • answerType: String. Type of the answer.
    • helpful: List of two integers. The first integer indicates the number of uses who found the answer helpful. The second integer indicates the total number of responses.

Here are some other fields that we use for evaluation and analysis:

  • asin: String. Unique product ID for the product the question pertains to.
  • qid: Integer. Unique question id for the question (in the entire dataset).
  • category: String. Product category.
  • top_review_wilson: String. The review with the highest wilson score.
  • top_review_helpful: String. The review voted as most helpful by the users.
  • is_answerable: boolean. Output of the answerability classifier indicating whether the question is answerable using the review snippets.
  • top_sentences_IR: List of strings. A list of top sentences (at most 10) based on IR score with the question.

Dataset Statistics

Our dataset consists of 923k questions, 3.6M ansheers and 14M reviews across 156k products. We build on the well-known Amazon dataset -

Additionally, we collect additional annotations, marking each question as either answerable or unanswerable based on the available reviews.

Data Processing

Scripts

The src/prepro/ folder contains all the scripts for generating raw and different processed datsets.

Raw Products Dataset

The script generates the raw train/val/test product splits by combining the well known amazon reviews and questions dataset for all the categories.

train val

Processed Dataset

The script creates question-answers pairs with query-relevant review snippets and is_answerable annotation by a trained classifier. More details regarding this step are mentioned in the section 3.1 Data Processing.

train val

Auxilliary Datasets

We also provide the scripts to convert our dataset to other question answering dataset formats like squad and ms-marco.

Span-based

The script converts our dataset to squad format by extracting snippets using different span-heuristics. More details regarding this step are mentioned in the section 5.2 Span-based QA model.

train val

Generative

The script converts our dataset MSMARCO format.

train val

amazonqa's People

Contributors

nitish-kulkarni avatar rchanda avatar mgupta1410 avatar anirudharc avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.