Git Product home page Git Product logo

dart's Introduction

DART: Open-Domain Structured Data Record to Text Generation

DART is a large and open-domain structured DAta Record to Text generation corpus with high-quality sentence annotations with each input being a set of entity-relation triples following a tree-structured ontology. It consists of 82191 examples across different domains with each input being a semantic triple set derived from data records in tables and the tree ontology of table schema, annotated with sentence description that covers all facts in the triple set.

DART is described with more details and baseline results in this paper.

Data Content and Format

The DART dataset is available in the data/v1.1.1/ directory. The dataset consists of a JSON version and a XML version of train/dev/test files in data/.

Each JSON file contains a list of tripleset-annotation pairs of the form:

  {
    "tripleset": [
      [
        "Ben Mauk",
        "High school",
        "Kenton"
      ],
      [
        "Ben Mauk",
        "College",
        "Wake Forest Cincinnati"
      ]
    ],
    "subtree_was_extended": false,
    "annotations": [
      {
        "source": "WikiTableQuestions_lily",
        "text": "Ben Mauk, who attended Kenton High School, attended Wake Forest Cincinnati for college."
      }
    ]
  }

Each XML file contains a list tripleset-lex pairs of the form:

  <entry category="MISC" eid="Id1" size="2">
    <modifiedtripleset>
      <mtriple>Mars Hill College | JOINED | 1973</mtriple>
      <mtriple>Mars Hill College | LOCATION | Mars Hill, North Carolina</mtriple>
    </modifiedtripleset>
    <lex comment="WikiSQL_decl_sents" lid="Id1">A school from Mars Hill, North Carolina, joined in 1973.</lex>
  </entry>

You can use data/v1.1.1/select_partitions.py to generate dataset that contains different partitions of DART, and note that different partitions have different sources of annotation. Specifically we have the following sources of annotation:

  • WikiTableQuestions_lily, WikiSQL_lily ⇒ Instances that are manually annotated by internal annotators
  • WikiTableQuestions_mturk ⇒ Instances that are manually annotated by MTurk workers
  • WikiSQL_decl_sents ⇒ Instances that are automatically annotated by a procedure described in Sec 2.2 of our paper
  • webnlg, e2e ⇒ Instances obtained by converting existing datasets, these partitions are less open-domained

In addition, we provide 4 settings of generating dataset for your research purpose:

  • manual: this setting includes all manually annotated instances
  • manual_and_auto: this setting includes both manually and automatically annotated instances, but excluding webnlg and e2e which are less open-domained partitions
  • full: this setting includes all partitions of DART
  • custom: you can choose any combination of partitions

Models

We also provide implementations we use to produce results in our paper. Please refer to model/ for more information.

Leaderboard

We maintain a leaderboard on our test set.

Model BLEU METEOR TER MoverScore BERTScore BLEURT PARENT
Control Prefixes (T5-large) (Clive et al., 2021) 51.95 0.41 0.43 - 0.95 - -
T5-large (Raffel et al., 2020) 50.66 0.40 0.43 0.54 0.95 0.44 0.58
BART-large (Lewis et al., 2020) 48.56 0.39 0.45 0.52 0.95 0.41 0.57
Seq2Seq-Att (MELBOURNE) 29.66 0.27 0.63 0.31 0.90 -0.13 0.35
End-to-End Transformer (Castro Ferreira et al., 2019) 27.24 0.25 0.65 0.25 0.89 -0.29 0.28

Citation

@inproceedings{nan-etal-2021-dart,
    title = "{DART}: Open-Domain Structured Data Record to Text Generation",
    author = "Nan, Linyong  and
      Radev, Dragomir  and
      Zhang, Rui  and
      Rau, Amrit  and
      Sivaprasad, Abhinand  and
      Hsieh, Chiachun  and
      Tang, Xiangru  and
      Vyas, Aadit  and
      Verma, Neha  and
      Krishna, Pranav  and
      Liu, Yangxiaokang  and
      Irwanto, Nadia  and
      Pan, Jessica  and
      Rahman, Faiaz  and
      Zaidi, Ahmad  and
      Mutuma, Mutethia  and
      Tarabar, Yasin  and
      Gupta, Ankit  and
      Yu, Tao  and
      Tan, Yi Chern  and
      Lin, Xi Victoria  and
      Xiong, Caiming  and
      Socher, Richard  and
      Rajani, Nazneen Fatema",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.naacl-main.37",
    doi = "10.18653/v1/2021.naacl-main.37",
    pages = "432--447",
    abstract = "We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs). Data-to-text annotations can be a costly process, especially when dealing with tables which are the major source of structured data and contain nontrivial structures. To this end, we propose a procedure of extracting semantic triples from tables that encodes their structures by exploiting the semantic dependencies among table headers and the table title. Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and spoken dialogue systems by utilizing techniques including tree ontology annotation, question-answer pair to declarative sentence conversion, and predicate unification, all with minimum post-editing. We present systematic evaluation on DART as well as new state-of-the-art results on WebNLG 2017 to show that DART (1) poses new challenges to existing data-to-text datasets and (2) facilitates out-of-domain generalization. Our data and code can be found at https://github.com/Yale-LILY/dart.",
}

dart's People

Contributors

amritrau avatar linyongnan avatar tangxiangru avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dart's Issues

Missing annotations in test set

Hi ! 😃

Great data augmentation you've done here!

I have noticed missing lexicalizations/annotations for some entries in your test set . (For example, eid=Id1392 in your dart-v1.1.1-full-test.xml file).

Do you plan to add those ?

Thanks!

Number of references used for training and testing

I noticed that the generate_input_dart.py only used 3 references for evaluation. However, some examples have many more references. I was wondering if you could provide more details about the results in the paper. Can't seem to replicate your results. Also if you can share the fine-tuned T5 and Bart model on dart this would be very helpful.

Update Evaluation used for V.1.1.1

The references in /evaluation/dart_reference are not for the current version. Can you replace with the new references and share the tokenization script that is done to the predictions.

I am getting very different BLEU scores depending on tokenization, and how many references I use.
As there are up to ~30 for a few examples.

I would like to directly compare against the README leaderboard.

about the size of DART dataset and its performance

Recently, I used GPT to do generation with DART dataset. However, I found that the test set may be different from other works. In fact, I can only get 5,097 samples for testing, while GEM website says their test set is 12,552. And the data provied in (Li, et al 2021) (https://github.com/XiangLi1999/PrefixTuning) also has 12,552 samples but they do not provide gold references.

Through the official evaluation scripts and test set, I obtain about 37-38 BLEU, which is much lower than the results (46-47 BLEU) reported by (Li, et al 2021) and other works (like the leaderboard in github: https://github.com/Yale-LILY/dart). So, I am confused that which one is right.

Could you please answer these questions if possible? I will be appreciate.

Reference

  1. Li X L, Liang P. Prefix-tuning: Optimizing continuous prompts for generation[J]. arXiv preprint arXiv:2101.00190, 2021.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.