Git Product home page Git Product logo

reference-resolution-via-text-generation's Introduction

Resolving References in Visually-Grounded Dialogue via Text Generation

๐Ÿšง NOTE: We are in the process of adding the material described in our paper to this repo. Our annotations for "A Game Of Sorts" are already available.

Repository for the paper "Resolving References in Visually-Grounded Dialogue via Text Generation" presented at SIGDIAL 2023. Please cite the following work if you use anything from this repository or from our paper:

@inproceedings{willemsen-etal-2023-resolving,
    title = "Resolving References in Visually-Grounded Dialogue via Text Generation",
    author = "Willemsen, Bram  and
      Qian, Livia  and
      Skantze, Gabriel",
    booktitle = "Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue",
    month = sep,
    year = "2023",
    address = "Prague, Czechia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.sigdial-1.43",
    pages = "457--469"
}

๐Ÿ“œ Overview


๐Ÿ”ญ The Task

In this paper, we treat visually-grounded reference resolution as a text-image retrieval task, where referents are represented by images. We frame the discourse processing side of the task as a causal language modeling problem. By fine-tuning an LLM for the purpose of referent description generation, we can augment the discourse processing capabilities of VLMs that have been pretrained to match relatively short, high-level descriptions with their associated images and have shown to be effective at zero-shot text-image retrieval based on such image descriptions, but that have not learned to process longer, conversational inputs. Referent description generation can be regarded as a special case of referring expression generation with the goal of always generating the most complete expression possible. For a given mention, the model is trained to generate a definite description that summarizes all information that has been explicitly disclosed about the referent during a conversation. We will refer to the fine-tuned model as the conversational referent description generator (CRDG). The description generated by the CRDG is then used by a pretrained VLM to identify the referent, zero-shot.

Figure 1: The proposed visually-grounded reference resolution framework. With the CRDG we generate a referent description for a marked mention, to be used by a (frozen) pretrained VLM for referent identification.

Figure 1 shows a visualization of the proposed framework. Our approach can be seen as an exploration of the limits of depending on linguistic context alone for generating referent descriptions, as the discourse processing and eventual grounding of the descriptions are entirely disjoint. For a more formal task definition, we refer the reader to Section 3.1 of our paper.


๐Ÿ“„ The Data

A Game Of Sorts

The data that were used for the fine-tuning and evaluation of our approach came from the collaborative image ranking task "A Game Of Sorts". For information about this task, we refer the reader to the "Collecting Visually-Grounded Dialogue with A Game Of Sorts" paper.

In order to reproduce our work and make effective use of our annotations, you will need the "A Game Of Sorts" data:

git clone https://github.com/willemsenbram/a-game-of-sorts.git

In order to download the images, in the ./a-game-of-sorts/dataset/ directory, run:

bash get_images.sh

The images will be downloaded to ./a-game-of-sorts/dataset/images.

Our Annotations

Span-based mention annotations aligned with the images they denote can be found in the ./annotations/data directory.

The referent descriptions from the various sources as discussed in the paper, including the manually constructed "ground truth" labels that have been used for fine-tuning and evaluation, can be found in the ./descriptions/data directory.


๐Ÿ The Code


๐Ÿ“š Supplementary Material

The supplementary material (supplementary_material.pdf) covers additional details about our human evaluation as well as hyperparameters used for model fine-tuning.

reference-resolution-via-text-generation's People

Contributors

willemsenbram avatar

Stargazers

 avatar  avatar

Watchers

Livia Qian avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.