[toc]

Adversarial In-Context Learning

Model

Algorithm

Super Natural-Instruction

A collection of 1616 tasks and their natural language definitions/instructons.

Tasks

Tasks follow this schema:

Example: Task 582

{    
  
  	"Definition": [
        "In this task, You are given an open-domain question that can be answered based on factual information. Your task is to provide \\*short\\* answer (in a few words only) for the given question. The short answer can be one or more entities or it can also be boolean \\*yes\\* or \\*no\\*."
    ],
 
 		
  	"Positive Examples": [
        {
            "input": "when are hops added to the brewing process?",
            "output": "The boiling process",
            "explanation": "The answer is correct because, at the end of the boil, solid particles in the hopped wort are separated."
        },
        {
            "input": "who played will on as the world turns?",
            "output": "Jesse Soffer",
            "explanation": "The answer is correct. William \"Will\" Harold Ryan Munson is a fictional character on the CBS soap opera As the World Turns. He was portrayed by Jesse Soffer on a recurring basis from September 2004 to March 2005."
        },
        {
            "input": "who won the election for mayor of cleveland?",
            "output": "Incumbent Democratic Mayor Frank G . Jackson",
            "explanation": "Incumbent Democratic Mayor Frank G . Jackson won reelection to a fourth term."
        }
    ],
  
  	"Negative Examples": [
        {
            "input": "where do dust storms occur in the US?",
            "output": " ",
            "explanation": "It generates no answer when it's supposed to give an answer."
        },
        {
            "input": "when did the watts riot start and end?",
            "output": "watts riot",
            "explanation": "The question is about the time watts riot started and ended, so the answer should be: August 11 to 16, 1965."
        },
        {
            "input": "when did kendrick lamars first album come out?",
            "output": "Ronald Reagan Era",
            "explanation": "It supposed to give the answer on what time Kendrick Lamar released his album, not the name of the album."
        }
    ],
  
  	"Instances": [
        {
            "id": "task582-bdd71027a2ec47e09f636e6609d5bdaf",
            "input": "where did they film hot tub time machine",
            "output": [
                "Fernie Alpine Resort"
            ]
        },
        {
            "id": "task582-60cb1ae6f8304abaad27c5e897698b78",
            "input": "who has the right of way in international waters",
            "output": [
                "Neither vessel"
            ]
        },
      ...
      ]
}

Task Types

They collect 76 task types, including but not limited to classification, extraction, infilling, sequence tagging, text rewriting, and text composition.

Metrics

They adopt ROUGE-L score to measure the accuracy, which is in (0, 1). ROUGE-L is based on the longest common subsequence between the model output and reference, i.e. the longest sequence of words (not necessarily consecutive, but still in order) that is shared between both.

Experiment Result

Baseline

======== Overall Metrics ========
exact_match: 52.6
rougeL: 65.5513
======== Metrics per Category ========
exact_match_for_question_rewriting: 2.0
rougeL_for_question_rewriting: 64.2207
exact_match_for_coreference_resolution: 85.0
rougeL_for_coreference_resolution: 85.0
exact_match_for_textual_entailment: 65.0
rougeL_for_textual_entailment: 65.0
exact_match_for_cause_effect_classification: 46.0
rougeL_for_cause_effect_classification: 46.5357
exact_match_for_word_analogy: 65.0
rougeL_for_word_analogy: 67.0