Ryo Kamoi, Tanya Goyal, Juan Diego Rodriguez, Greg Durrett
This repository contains the dataset for "WiCE: Real-World Entailment for Claims in Wikipedia".
data
directory includes the WiCE dataset.
data/entailment_retrieval
includes the WiCE dataset for entailment and retrieval task. data/entailment_retrieval/claim
includes data with the original claims and data/entailment_retrieval/subclaims
includes data with the decomposed claims (finegrained annotation by using Claim-Split).
Each sub-directory includes jsonl files for train, dev, and test sets. Here is an example of the data in the jsonl files:
{
"label": "partially_supported",
"supporting_sentences": [[5, 15], [15, 17]],
"claim": "Arnold is currently the publisher and editorial director of Media Play News, one of five Hollywood trades and the only one dedicated to the home entertainment sector.",
"evidence": [list of evidence sentences],
"meta": {"id": "dev02986", "claim_title": "Roger Hedgecock", "claim_section": "Other endeavors.", "claim_context": [paragraph]}
}
label
: Entailment label in {supported
,partially_supported
,not_supported
}supporting_sentences
: List of indices of supporting sentences. All provided sets of supporting sentences are valid (in the above example, both[5, 15]
and[5, 17]
are annotated as correct sets of supporting sentences that include same information).claim
: A sentence from Wikipediaevidence
: A list of sentences in the cited websitemeta
claim_title
: Title of the Wikipedia page that includesclaim
claim_section
: Section that includesclaim
claim_context
: Sentences just beforeclaim
data/non_supported_tokens
includes the WiCE dataset for non-supported tokens detection task. We only provide annotation for sub-claims that are annotated as partially_supported
. We filtered out data points with low inter-annotator agreement (please refer to the paper for details).
{
"claim": "Irene Hervey appeared in over fifty films and numerous television series.",
"claim_tokens": ["Irene", "Hervey", "appeared", "in", "over", "fifty", "films", "and", "numerous", "television", "series", "."],
"non_supported_spans": [false, false, false, false, true, true, false, false, false, false, false, false],
"evidence": [list of evidence sentences],
"meta": {"id": "test00561-1", "claim_title": "Irene Hervey", "claim_section": "Abstract.", "claim_context": " Irene Hervey was an American film, stage, and television actress."}
}
claim_tokens
: List of tokens in the claimnon_supported_spans
: List of bool corresponding toclaim_tokens
(true
is non-supported tokens)
claim_split
directory includes prompts for Claim-Split, a method to decompose claims by using GPT-3. We use different prompts for different datasets in the experiments in this work, so we provide prompts for WiCE, VitaminC, PAWS, and FRANK (XSum).