Git Product home page Git Product logo

barch_paper_gmj's Introduction

Barch: corpus of bar chart summaries in English ๐Ÿ“Š

This repository contains the Barch corpus of human-written bar chart summaries.

We designed bar charts for a selection of 18 different topics. Each chart is associated with one of four main messages which are signaled in the chart title. For each chart we collected around 20 summaries via crowdsourcing by asking participants to describe the chart as if they were presenting it to an audience. The text in the summaries is labeled (aligned) with the chart data for basic information (axis labels, bar names and heights) as well as analytical information as inferred by humans (relations between bars, height approximations).

no. topic no. chart no. summary no. token no. sentence
18 47 1,063 57,420 3,356

Annotation guidelines

The repository also provides the annotation guidelines for aligning summary and chart data.

NLG

We used the data to train several natural language generation models, namely a baseline LSTM encoder-decoder with attention, Chart2Text (Obeid and Hoque 2020) and KGPT (Chen et al. 2021).

Project structure

.
โ”œโ”€โ”€ data                                      # Charts and summaries 
โ”‚   โ”œโ”€โ”€ chart_summaries.xml                   # Annotated summaries arranged by topic and chart
โ”‚   โ”œโ”€โ”€ charts                                # Chart images and summaries by topic
โ”‚   โ”‚   โ”œโ”€โ”€ 01
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ””โ”€โ”€ chartID2plotinto.json                 # Plotting information for each chart
โ”œโ”€โ”€ splits_nlg                                # Data splits for NLG experiments
โ”‚   โ”œโ”€โ”€ c2t                                   # Data for the Chart2Text model
โ”‚   โ”œโ”€โ”€ kgpt                                  # Data for the KGPT model
โ”‚   โ”œโ”€โ”€ lstm                                  # Data for the LSTM model
โ”‚   โ””โ”€โ”€ splits_combinations_ids.json        # Summary IDs by data splits
โ”œโ”€โ”€ Annotation_Guidelines_2.0.pdf           # Annotation guidelines for labeling summaries
โ””โ”€โ”€ README.md

Citing

The dataset is described in a paper that was submitted to review for the LREC 2022 conference.

barch_paper_gmj's People

Contributors

izaskr avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.