Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ioannis Konstas

Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs

Oct 26, 2024

Houman Mehrafarin, Arash Eshghi, Ioannis Konstas

Abstract:Evaluating Large Language Models (LLMs) on reasoning benchmarks demonstrates their ability to solve compositional questions. However, little is known of whether these models engage in genuine logical reasoning or simply rely on implicit cues to generate answers. In this paper, we investigate the transitive reasoning capabilities of two distinct LLM architectures, LLaMA 2 and Flan-T5, by manipulating facts within two compositional datasets: QASC and Bamboogle. We controlled for potential cues that might influence the models' performance, including (a) word/phrase overlaps across sections of test input; (b) models' inherent knowledge during pre-training or fine-tuning; and (c) Named Entities. Our findings reveal that while both models leverage (a), Flan-T5 shows more resilience to experiments (b and c), having less variance than LLaMA 2. This suggests that models may develop an understanding of transitivity through fine-tuning on knowingly relevant datasets, a hypothesis we leave to future work.

* To appear in EMNLP Main 2024

Via

Access Paper or Ask Questions

CROPE: Evaluating In-Context Adaptation of Vision and Language Models to Culture-Specific Concepts

Oct 20, 2024

Malvina Nikandrou, Georgios Pantazopoulos, Nikolas Vitsakis, Ioannis Konstas, Alessandro Suglia

Abstract:As Vision and Language models (VLMs) become accessible across the globe, it is important that they demonstrate cultural knowledge. In this paper, we introduce CROPE, a visual question answering benchmark designed to probe the knowledge of culture-specific concepts and evaluate the capacity for cultural adaptation through contextual information. This allows us to distinguish between parametric knowledge acquired during training and contextual knowledge provided during inference via visual and textual descriptions. Our evaluation of several state-of-the-art open VLMs shows large performance disparities between culture-specific and common concepts in the parametric setting. Moreover, experiments with contextual knowledge indicate that models struggle to effectively utilize multimodal information and bind culture-specific concepts to their depictions. Our findings reveal limitations in the cultural understanding and adaptability of current VLMs that need to be addressed toward more culturally inclusive models.

Via

Access Paper or Ask Questions

Re-examining Sexism and Misogyny Classification with Annotator Attitudes

Oct 04, 2024

Aiqi Jiang, Nikolas Vitsakis, Tanvi Dinkar, Gavin Abercrombie, Ioannis Konstas

Figure 1 for Re-examining Sexism and Misogyny Classification with Annotator Attitudes

Figure 2 for Re-examining Sexism and Misogyny Classification with Annotator Attitudes

Figure 3 for Re-examining Sexism and Misogyny Classification with Annotator Attitudes

Figure 4 for Re-examining Sexism and Misogyny Classification with Annotator Attitudes

Abstract:Gender-Based Violence (GBV) is an increasing problem online, but existing datasets fail to capture the plurality of possible annotator perspectives or ensure the representation of affected groups. We revisit two important stages in the moderation pipeline for GBV: (1) manual data labelling; and (2) automated classification. For (1), we examine two datasets to investigate the relationship between annotator identities and attitudes and the responses they give to two GBV labelling tasks. To this end, we collect demographic and attitudinal information from crowd-sourced annotators using three validated surveys from Social Psychology. We find that higher Right Wing Authoritarianism scores are associated with a higher propensity to label text as sexist, while for Social Dominance Orientation and Neosexist Attitudes, higher scores are associated with a negative tendency to do so. For (2), we conduct classification experiments using Large Language Models and five prompting strategies, including infusing prompts with annotator information. We find: (i) annotator attitudes affect the ability of classifiers to predict their labels; (ii) including attitudinal information can boost performance when we use well-structured brief annotator descriptions; and (iii) models struggle to reflect the increased complexity and imbalanced classes of the new label sets.

* Accepted by EMNLP 2024

Via

Access Paper or Ask Questions

Voices in a Crowd: Searching for Clusters of Unique Perspectives

Jul 19, 2024

Nikolas Vitsakis, Amit Parekh, Ioannis Konstas

Figure 1 for Voices in a Crowd: Searching for Clusters of Unique Perspectives

Figure 2 for Voices in a Crowd: Searching for Clusters of Unique Perspectives

Figure 3 for Voices in a Crowd: Searching for Clusters of Unique Perspectives

Figure 4 for Voices in a Crowd: Searching for Clusters of Unique Perspectives

Abstract:Language models have been shown to reproduce underlying biases existing in their training data, which is the majority perspective by default. Proposed solutions aim to capture minority perspectives by either modelling annotator disagreements or grouping annotators based on shared metadata, both of which face significant challenges. We propose a framework that trains models without encoding annotator metadata, extracts latent embeddings informed by annotator behaviour, and creates clusters of similar opinions, that we refer to as voices. Resulting clusters are validated post-hoc via internal and external quantitative metrics, as well a qualitative analysis to identify the type of voice that each cluster represents. Our results demonstrate the strong generalisation capability of our framework, indicated by resulting clusters being adequately robust, while also capturing minority perspectives based on different demographic factors throughout two distinct datasets.

Via

Access Paper or Ask Questions

Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks

Jul 04, 2024

Amit Parekh, Nikolas Vitsakis, Alessandro Suglia, Ioannis Konstas

Figure 1 for Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks

Figure 2 for Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks

Figure 3 for Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks

Figure 4 for Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks

Abstract:Evaluating the generalisation capabilities of multimodal models based solely on their performance on out-of-distribution data fails to capture their true robustness. This work introduces a comprehensive evaluation framework that systematically examines the role of instructions and inputs in the generalisation abilities of such models, considering architectural design, input perturbations across language and vision modalities, and increased task complexity. The proposed framework uncovers the resilience of multimodal models to extreme instruction perturbations and their vulnerability to observational changes, raising concerns about overfitting to spurious correlations. By employing this evaluation framework on current Transformer-based multimodal models for robotic manipulation tasks, we uncover limitations and suggest future advancements should focus on architectural and training innovations that better integrate multimodal inputs, enhancing a model's generalisation prowess by prioritising sensitivity to input content over incidental correlations.

Via

Access Paper or Ask Questions

Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation

Jun 27, 2024

Malvina Nikandrou, Georgios Pantazopoulos, Ioannis Konstas, Alessandro Suglia

Figure 1 for Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation

Figure 2 for Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation

Figure 3 for Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation

Figure 4 for Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation

Abstract:Continual learning focuses on incrementally training a model on a sequence of tasks with the aim of learning new tasks while minimizing performance drop on previous tasks. Existing approaches at the intersection of Continual Learning and Visual Question Answering (VQA) do not study how the multimodal nature of the input affects the learning dynamics of a model. In this paper, we demonstrate that each modality evolves at different rates across a continuum of tasks and that this behavior occurs in established encoder-only models as well as modern recipes for developing Vision & Language (VL) models. Motivated by this observation, we propose a modality-aware feature distillation (MAFED) approach which outperforms existing baselines across models of varying scale in three multimodal continual learning settings. Furthermore, we provide ablations showcasing that modality-aware distillation complements experience replay. Overall, our results emphasize the importance of addressing modality-specific dynamics to prevent forgetting in multimodal continual learning.

Via

Access Paper or Ask Questions

AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding

Jun 19, 2024

Alessandro Suglia, Claudio Greco, Katie Baker, Jose L. Part, Ioannis Papaioannou, Arash Eshghi, Ioannis Konstas, Oliver Lemon

Abstract:AI personal assistants deployed via robots or wearables require embodied understanding to collaborate with humans effectively. However, current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric perceptual experience. To address this gap, we propose three key contributions. First, we introduce the Egocentric Video Understanding Dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos. Second, we present AlanaVLM, a 7B parameter VLM trained using parameter-efficient methods on EVUD. Finally, we evaluate AlanaVLM's capabilities on OpenEQA, a challenging benchmark for embodied video question answering. Our model achieves state-of-the-art performance, outperforming open-source models including strong Socratic models using GPT-4 as a planner by 3.6%. Additionally, we outperform Claude 3 and Gemini Pro Vision 1.0 and showcase competitive results compared to Gemini Pro 1.5 and GPT-4V, even surpassing the latter in spatial reasoning. This research paves the way for building efficient VLMs that can be deployed in robots or wearables, leveraging embodied video understanding to collaborate seamlessly with humans in everyday tasks, contributing to the next generation of Embodied AI

* Code available https://github.com/alanaai/EVUD

Via

Access Paper or Ask Questions

Visually Grounded Language Learning: a review of language games, datasets, tasks, and models

Dec 05, 2023

Alessandro Suglia, Ioannis Konstas, Oliver Lemon

Abstract:In recent years, several machine learning models have been proposed. They are trained with a language modelling objective on large-scale text-only data. With such pretraining, they can achieve impressive results on many Natural Language Understanding and Generation tasks. However, many facets of meaning cannot be learned by ``listening to the radio" only. In the literature, many Vision+Language (V+L) tasks have been defined with the aim of creating models that can ground symbols in the visual modality. In this work, we provide a systematic literature review of several tasks and models proposed in the V+L field. We rely on Wittgenstein's idea of `language games' to categorise such tasks into 3 different families: 1) discriminative games, 2) generative games, and 3) interactive games. Our analysis of the literature provides evidence that future work should be focusing on interactive games where communication in Natural Language is important to resolve ambiguities about object referents and action plans and that physical embodiment is essential to understand the semantics of situations and events. Overall, these represent key requirements for developing grounded meanings in neural models.

* Preprint for JAIR before copyediting

Via

Access Paper or Ask Questions

Multitask Multimodal Prompted Training for Interactive Embodied Task Completion

Nov 07, 2023

Georgios Pantazopoulos, Malvina Nikandrou, Amit Parekh, Bhathiya Hemanthage, Arash Eshghi, Ioannis Konstas, Verena Rieser, Oliver Lemon, Alessandro Suglia

Figure 1 for Multitask Multimodal Prompted Training for Interactive Embodied Task Completion

Figure 2 for Multitask Multimodal Prompted Training for Interactive Embodied Task Completion

Figure 3 for Multitask Multimodal Prompted Training for Interactive Embodied Task Completion

Figure 4 for Multitask Multimodal Prompted Training for Interactive Embodied Task Completion

Abstract:Interactive and embodied tasks pose at least two fundamental challenges to existing Vision & Language (VL) models, including 1) grounding language in trajectories of actions and observations, and 2) referential disambiguation. To tackle these challenges, we propose an Embodied MultiModal Agent (EMMA): a unified encoder-decoder model that reasons over images and trajectories, and casts action prediction as multimodal text generation. By unifying all tasks as text generation, EMMA learns a language of actions which facilitates transfer across tasks. Different to previous modular approaches with independently trained components, we use a single multitask model where each task contributes to goal completion. EMMA performs on par with similar models on several VL benchmarks and sets a new state-of-the-art performance (36.81% success rate) on the Dialog-guided Task Completion (DTC), a benchmark to evaluate dialog-guided agents in the Alexa Arena

* EMNLP 2023

Via

Access Paper or Ask Questions

No that's not what I meant: Handling Third Position Repair in Conversational Question Answering

Jul 31, 2023

Vevake Balaraman, Arash Eshghi, Ioannis Konstas, Ioannis Papaioannou

Figure 1 for No that's not what I meant: Handling Third Position Repair in Conversational Question Answering

Figure 2 for No that's not what I meant: Handling Third Position Repair in Conversational Question Answering

Figure 3 for No that's not what I meant: Handling Third Position Repair in Conversational Question Answering

Figure 4 for No that's not what I meant: Handling Third Position Repair in Conversational Question Answering

Abstract:The ability to handle miscommunication is crucial to robust and faithful conversational AI. People usually deal with miscommunication immediately as they detect it, using highly systematic interactional mechanisms called repair. One important type of repair is Third Position Repair (TPR) whereby a speaker is initially misunderstood but then corrects the misunderstanding as it becomes apparent after the addressee's erroneous response. Here, we collect and publicly release Repair-QA, the first large dataset of TPRs in a conversational question answering (QA) setting. The data is comprised of the TPR turns, corresponding dialogue contexts, and candidate repairs of the original turn for execution of TPRs. We demonstrate the usefulness of the data by training and evaluating strong baseline models for executing TPRs. For stand-alone TPR execution, we perform both automatic and human evaluations on a fine-tuned T5 model, as well as OpenAI's GPT-3 LLMs. Additionally, we extrinsically evaluate the LLMs' TPR processing capabilities in the downstream conversational QA task. The results indicate poor out-of-the-box performance on TPR's by the GPT-3 models, which then significantly improves when exposed to Repair-QA.

* Accepted at SIGDIAL'23

Via

Access Paper or Ask Questions