Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bram Willemsen

Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding

Sep 09, 2024

Bram Willemsen, Gabriel Skantze

Figure 1 for Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding

Figure 2 for Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding

Figure 3 for Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding

Figure 4 for Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding

Abstract:We propose an approach to referring expression generation (REG) in visually grounded dialogue that is meant to produce referring expressions (REs) that are both discriminative and discourse-appropriate. Our method constitutes a two-stage process. First, we model REG as a text- and image-conditioned next-token prediction task. REs are autoregressively generated based on their preceding linguistic context and a visual representation of the referent. Second, we propose the use of discourse-aware comprehension guiding as part of a generate-and-rerank strategy through which candidate REs generated with our REG model are reranked based on their discourse-dependent discriminatory power. Results from our human evaluation indicate that our proposed two-stage approach is effective in producing discriminative REs, with higher performance in terms of text-image retrieval accuracy for reranked REs compared to those generated using greedy decoding.

* Accepted for publication at INLG 2024

Via

Access Paper or Ask Questions

Resolving References in Visually-Grounded Dialogue via Text Generation

Sep 23, 2023

Bram Willemsen, Livia Qian, Gabriel Skantze

Figure 1 for Resolving References in Visually-Grounded Dialogue via Text Generation

Figure 2 for Resolving References in Visually-Grounded Dialogue via Text Generation

Figure 3 for Resolving References in Visually-Grounded Dialogue via Text Generation

Figure 4 for Resolving References in Visually-Grounded Dialogue via Text Generation

Abstract:Vision-language models (VLMs) have shown to be effective at image retrieval based on simple text queries, but text-image retrieval based on conversational input remains a challenge. Consequently, if we want to use VLMs for reference resolution in visually-grounded dialogue, the discourse processing capabilities of these models need to be augmented. To address this issue, we propose fine-tuning a causal large language model (LLM) to generate definite descriptions that summarize coreferential information found in the linguistic context of references. We then use a pretrained VLM to identify referents based on the generated descriptions, zero-shot. We evaluate our approach on a manually annotated dataset of visually-grounded dialogues and achieve results that, on average, exceed the performance of the baselines we compare against. Furthermore, we find that using referent descriptions based on larger context windows has the potential to yield higher returns.

* Published at SIGDIAL 2023

Via

Access Paper or Ask Questions

Collecting Visually-Grounded Dialogue with A Game Of Sorts

Sep 10, 2023

Bram Willemsen, Dmytro Kalpakchi, Gabriel Skantze

Figure 1 for Collecting Visually-Grounded Dialogue with A Game Of Sorts

Figure 2 for Collecting Visually-Grounded Dialogue with A Game Of Sorts

Figure 3 for Collecting Visually-Grounded Dialogue with A Game Of Sorts

Figure 4 for Collecting Visually-Grounded Dialogue with A Game Of Sorts

Abstract:An idealized, though simplistic, view of the referring expression production and grounding process in (situated) dialogue assumes that a speaker must merely appropriately specify their expression so that the target referent may be successfully identified by the addressee. However, referring in conversation is a collaborative process that cannot be aptly characterized as an exchange of minimally-specified referring expressions. Concerns have been raised regarding assumptions made by prior work on visually-grounded dialogue that reveal an oversimplified view of conversation and the referential process. We address these concerns by introducing a collaborative image ranking task, a grounded agreement game we call "A Game Of Sorts". In our game, players are tasked with reaching agreement on how to rank a set of images given some sorting criterion through a largely unrestricted, role-symmetric dialogue. By putting emphasis on the argumentation in this mixed-initiative interaction, we collect discussions that involve the collaborative referential process. We describe results of a small-scale data collection experiment with the proposed task. All discussed materials, which includes the collected data, the codebase, and a containerized version of the application, are publicly available.

* Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC 2022), pages 2257-2268, Marseille, France. European Language Resources Association
* Published at LREC 2022

Via

Access Paper or Ask Questions

CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings

Nov 15, 2021

Gabriel Skantze, Bram Willemsen

Figure 1 for CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings

Figure 2 for CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings

Figure 3 for CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings

Figure 4 for CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings

Abstract:This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is grounded in vision. Given a pre-trained multimodal embedding model, where language and images are projected in the same semantic space (in this case CLIP by OpenAI), CoLLIE learns a transformation function that adjusts the language embeddings when needed to accommodate new language use. Unlike traditional few-shot learning, the model does not just learn new classes and labels, but can also generalize to similar language use. We verify the model's performance on two different tasks of continual learning and show that it can efficiently learn and generalize from only a few examples, with little interference with the model's original zero-shot performance.

Via

Access Paper or Ask Questions