Resolving References in Visually-Grounded Dialogue via Text Generation

🚧 NOTE: We are in the process of adding the material described in our paper to this repo. Our annotations for "A Game Of Sorts" are already available.

Repository for the paper "Resolving References in Visually-Grounded Dialogue via Text Generation" presented at SIGDIAL 2023. Please cite the following work if you use anything from this repository or from our paper:

@inproceedings{willemsen-etal-2023-resolving,
    title = "Resolving References in Visually-Grounded Dialogue via Text Generation",
    author = "Willemsen, Bram  and
      Qian, Livia  and
      Skantze, Gabriel",
    booktitle = "Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue",
    month = sep,
    year = "2023",
    address = "Prague, Czechia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.sigdial-1.43",
    pages = "457--469"
}

📜 Overview

🔭 The Task
📄 The Data
- A Game Of Sorts
- Our Annotations
🍝 The Code
📚 Supplementary Material

🔭 The Task

In this paper, we treat visually-grounded reference resolution as a text-image retrieval task, where referents are represented by images. We frame the discourse processing side of the task as a causal language modeling problem. By fine-tuning an LLM for the purpose of referent description generation, we can augment the discourse processing capabilities of VLMs that have been pretrained to match relatively short, high-level descriptions with their associated images and have shown to be effective at zero-shot text-image retrieval based on such image descriptions, but that have not learned to process longer, conversational inputs. Referent description generation can be regarded as a special case of referring expression generation with the goal of always generating the most complete expression possible. For a given mention, the model is trained to generate a definite description that summarizes all information that has been explicitly disclosed about the referent during a conversation. We will refer to the fine-tuned model as the conversational referent description generator (CRDG). The description generated by the CRDG is then used by a pretrained VLM to identify the referent, zero-shot.

Figure 1: The proposed visually-grounded reference resolution framework. With the CRDG we generate a referent description for a marked mention, to be used by a (frozen) pretrained VLM for referent identification.

Figure 1 shows a visualization of the proposed framework. Our approach can be seen as an exploration of the limits of depending on linguistic context alone for generating referent descriptions, as the discourse processing and eventual grounding of the descriptions are entirely disjoint. For a more formal task definition, we refer the reader to Section 3.1 of our paper.

📄 The Data

A Game Of Sorts

The data that were used for the fine-tuning and evaluation of our approach came from the collaborative image ranking task "A Game Of Sorts". For information about this task, we refer the reader to the "Collecting Visually-Grounded Dialogue with A Game Of Sorts" paper.

In order to reproduce our work and make effective use of our annotations, you will need the "A Game Of Sorts" data:

git clone https://github.com/willemsenbram/a-game-of-sorts.git

In order to download the images, in the ./a-game-of-sorts/dataset/ directory, run:

bash get_images.sh

The images will be downloaded to ./a-game-of-sorts/dataset/images.

Our Annotations

Span-based mention annotations aligned with the images they denote can be found in the ./annotations/data directory.

The referent descriptions from the various sources as discussed in the paper, including the manually constructed "ground truth" labels that have been used for fine-tuning and evaluation, can be found in the ./descriptions/data directory.

🍝 The Code

📚 Supplementary Material

The supplementary material (supplementary_material.pdf) covers additional details about our human evaluation as well as hyperparameters used for model fine-tuning.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
annotations		annotations
descriptions		descriptions
evaluation		evaluation
LICENSE		LICENSE
README.md		README.md
crdg_framework.png		crdg_framework.png
paper.pdf		paper.pdf
reference.bib		reference.bib
supplementary_material.pdf		supplementary_material.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Resolving References in Visually-Grounded Dialogue via Text Generation

📜 Overview

🔭 The Task

📄 The Data

A Game Of Sorts

Our Annotations

🍝 The Code

📚 Supplementary Material

About

Releases

Packages

Languages

License

willemsenbram/reference-resolution-via-text-generation

Folders and files

Latest commit

History

Repository files navigation

Resolving References in Visually-Grounded Dialogue via Text Generation

📜 Overview

🔭 The Task

📄 The Data

A Game Of Sorts

Our Annotations

🍝 The Code

📚 Supplementary Material

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages