Abstract:When operating in human environments, robots need to handle complex tasks while both adhering to social norms and accommodating individual preferences. For instance, based on common sense knowledge, a household robot can predict that it should avoid vacuuming during a social gathering, but it may still be uncertain whether it should vacuum before or after having guests. In such cases, integrating common-sense knowledge with human preferences, often conveyed through human explanations, is fundamental yet a challenge for existing systems. In this paper, we introduce GRACE, a novel approach addressing this while generating socially appropriate robot actions. GRACE leverages common sense knowledge from Large Language Models (LLMs), and it integrates this knowledge with human explanations through a generative network architecture. The bidirectional structure of GRACE enables robots to refine and enhance LLM predictions by utilizing human explanations and makes robots capable of generating such explanations for human-specified actions. Our experimental evaluations show that integrating human explanations boosts GRACE's performance, where it outperforms several baselines and provides sensible explanations.
Abstract:Social agents and robots are increasingly being used in wellbeing settings. However, a key challenge is that these agents and robots typically rely on machine learning (ML) algorithms to detect and analyse an individual's mental wellbeing. The problem of bias and fairness in ML algorithms is becoming an increasingly greater source of concern. In concurrence, existing literature has also indicated that mental health conditions can manifest differently across genders and cultures. We hypothesise that the representation of features (acoustic, textual, and visual) and their inter-modal relations would vary among subjects from different cultures and genders, thus impacting the performance and fairness of various ML models. We present the very first evaluation of multimodal gender fairness in depression manifestation by undertaking a study on two different datasets from the USA and China. We undertake thorough statistical and ML experimentation and repeat the experiments for several different algorithms to ensure that the results are not algorithm-dependent. Our findings indicate that though there are differences between both datasets, it is not conclusive whether this is due to the difference in depression manifestation as hypothesised or other external factors such as differences in data collection methodology. Our findings further motivate a call for a more consistent and culturally aware data collection process in order to address the problem of ML bias in depression detection and to promote the development of fairer agents and robots for wellbeing.
Abstract:Despite the recent advancements in robotics and machine learning (ML), the deployment of autonomous robots in our everyday lives is still an open challenge. This is due to multiple reasons among which are their frequent mistakes, such as interrupting people or having delayed responses, as well as their limited ability to understand human speech, i.e., failure in tasks like transcribing speech to text. These mistakes may disrupt interactions and negatively influence human perception of these robots. To address this problem, robots need to have the ability to detect human-robot interaction (HRI) failures. The ERR@HRI 2024 challenge tackles this by offering a benchmark multimodal dataset of robot failures during human-robot interactions (HRI), encouraging researchers to develop and benchmark multimodal machine learning models to detect these failures. We created a dataset featuring multimodal non-verbal interaction data, including facial, speech, and pose features from video clips of interactions with a robotic coach, annotated with labels indicating the presence or absence of robot mistakes, user awkwardness, and interaction ruptures, allowing for the training and evaluation of predictive models. Challenge participants have been invited to submit their multimodal ML models for detection of robot errors and to be evaluated against various performance metrics such as accuracy, precision, recall, F1 score, with and without a margin of error reflecting the time-sensitivity of these metrics. The results of this challenge will help the research field in better understanding the robot failures in human-robot interactions and designing autonomous robots that can mitigate their own errors after successfully detecting them.
Abstract:Existing Graph Neural Networks (GNNs) are limited to process graphs each of whose vertices is represented by a vector or a single value, limited their representing capability to describe complex objects. In this paper, we propose the first GNN (called Graph in Graph Neural (GIG) Network) which can process graph-style data (called GIG sample) whose vertices are further represented by graphs. Given a set of graphs or a data sample whose components can be represented by a set of graphs (called multi-graph data sample), our GIG network starts with a GIG sample generation (GSG) module which encodes the input as a \textbf{GIG sample}, where each GIG vertex includes a graph. Then, a set of GIG hidden layers are stacked, with each consisting of: (1) a GIG vertex-level updating (GVU) module that individually updates the graph in every GIG vertex based on its internal information; and (2) a global-level GIG sample updating (GGU) module that updates graphs in all GIG vertices based on their relationships, making the updated GIG vertices become global context-aware. This way, both internal cues within the graph contained in each GIG vertex and the relationships among GIG vertices could be utilized for down-stream tasks. Experimental results demonstrate that our GIG network generalizes well for not only various generic graph analysis tasks but also real-world multi-graph data analysis (e.g., human skeleton video-based action recognition), which achieved the new state-of-the-art results on 13 out of 14 evaluated datasets. Our code is publicly available at https://github.com/wangjs96/Graph-in-Graph-Neural-Network.
Abstract:Recent studies show bias in many machine learning models for depression detection, but bias in LLMs for this task remains unexplored. This work presents the first attempt to investigate the degree of gender bias present in existing LLMs (ChatGPT, LLaMA 2, and Bard) using both quantitative and qualitative approaches. From our quantitative evaluation, we found that ChatGPT performs the best across various performance metrics and LLaMA 2 outperforms other LLMs in terms of group fairness metrics. As qualitative fairness evaluation remains an open research question we propose several strategies (e.g., word count, thematic analysis) to investigate whether and how a qualitative evaluation can provide valuable insights for bias analysis beyond what is possible with quantitative evaluation. We found that ChatGPT consistently provides a more comprehensive, well-reasoned explanation for its prediction compared to LLaMA 2. We have also identified several themes adopted by LLMs to qualitatively evaluate gender fairness. We hope our results can be used as a stepping stone towards future attempts at improving qualitative evaluation of fairness for LLMs especially for high-stakes tasks such as depression detection.
Abstract:Socially Assistive Robots are studied in different Child-Robot Interaction settings. However, logistical constraints limit accessibility, particularly affecting timely support for mental wellbeing. In this work, we have investigated whether online interactions with a robot can be used for the assessment of mental wellbeing in children. The children (N=40, 20 girls and 20 boys; 8-13 years) interacted with the Nao robot (30-45 mins) over three sessions, at least a week apart. Audio-visual recordings were collected throughout the sessions that concluded with the children answering user perception questionnaires pertaining to their anxiety towards the robot, and the robot's abilities. We divided the participants into three wellbeing clusters (low, med and high tertiles) using their responses to the Short Moods and Feelings Questionnaire (SMFQ) and further analysed how their wellbeing and their perceptions of the robot changed over the wellbeing tertiles, across sessions and across participants' gender. Our primary findings suggest that (I) online mediated-interactions with robots can be effective in assessing children's mental wellbeing over time, and (II) children's overall perception of the robot either improved or remained consistent across time. Supplementary exploratory analyses have also revealed that gender affected the children's wellbeing assessments as well as their perceptions of the robot.
Abstract:As social robots become increasingly integrated into daily life, ensuring their behaviours align with social norms is crucial. For their widespread open-world application, it is important to explore Federated Learning (FL) settings where individual robots can learn about their unique environments while also learning from each others' experiences. In this paper, we present a novel FL benchmark that evaluates different strategies, using multi-label regression objectives, where each client individually learns to predict the social appropriateness of different robot actions while also sharing their learning with others. Furthermore, splitting the training data by different contexts such that each client incrementally learns across contexts, we present a novel Federated Continual Learning (FCL) benchmark that adapts FL-based methods to use state-of-the-art Continual Learning (CL) methods to continually learn socially appropriate agent behaviours under different contextual settings. Federated Averaging (FedAvg) of weights emerges as a robust FL strategy while rehearsal-based FCL enables incrementally learning the social appropriateness of robot actions, across contextual splits.
Abstract:The interdisciplinary nature of Child-Robot Interaction (CRI) fosters incorporating measures and methodologies from many established domains. However, when employing CRI approaches to sensitive avenues of health and wellbeing, caution is critical in adapting metrics to retain their safety standards and ensure accurate utilisation. In this work, we conducted a secondary analysis to previous empirical work, investigating the reliability and construct validity of established psychological questionnaires such as the Short Moods and Feelings Questionnaire (SMFQ) and three subscales (generalised anxiety, panic and low mood) of the Revised Child Anxiety and Depression Scale (RCADS) within a CRI setting for the assessment of mental wellbeing. Through confirmatory principal component analysis, we have observed that these measures are reliable and valid in the context of CRI. Furthermore, our analysis revealed that scales communicated by a robot demonstrated a better fit than when self-reported, underscoring the efficiency and effectiveness of robot-mediated psychological assessments in these settings. Nevertheless, we have also observed variations in item contributions to the main factor, suggesting potential areas of examination and revision (e.g., relating to physiological changes, inactivity and cognitive demands) when used in CRI. Findings from this work highlight the importance of verifying the reliability and validity of standardised metrics and assessment tools when employed in CRI settings, thus, aiming to avoid any misinterpretations and misrepresentations.
Abstract:Robotic coaches have been recently investigated to promote mental well-being in various contexts such as workplaces and homes. With the widespread use of Large Language Models (LLMs), HRI researchers are called to consider language appropriateness when using such generated language for robotic mental well-being coaches in the real world. Therefore, this paper presents the first work that investigated the language appropriateness of robot mental well-being coach in the workplace. To this end, we conducted an empirical study that involved 17 employees who interacted over 4 weeks with a robotic mental well-being coach equipped with LLM-based capabilities. After the study, we individually interviewed them and we conducted a focus group of 1.5 hours with 11 of them. The focus group consisted of: i) an ice-breaking activity, ii) evaluation of robotic coach language appropriateness in various scenarios, and iii) listing shoulds and shouldn'ts for designing appropriate robotic coach language for mental well-being. From our qualitative evaluation, we found that a language-appropriate robotic coach should (1) ask deep questions which explore feelings of the coachees, rather than superficial questions, (2) express and show emotional and empathic understanding of the context, and (3) not make any assumptions without clarifying with follow-up questions to avoid bias and stereotyping. These results can inform the design of language-appropriate robotic coach to promote mental well-being in real-world contexts.
Abstract:In dyadic interactions, humans communicate their intentions and state of mind using verbal and non-verbal cues, where multiple different facial reactions might be appropriate in response to a specific speaker behaviour. Then, how to develop a machine learning (ML) model that can automatically generate multiple appropriate, diverse, realistic and synchronised human facial reactions from an previously unseen speaker behaviour is a challenging task. Following the successful organisation of the first REACT challenge (REACT 2023), this edition of the challenge (REACT 2024) employs a subset used by the previous challenge, which contains segmented 30-secs dyadic interaction clips originally recorded as part of the NOXI and RECOLA datasets, encouraging participants to develop and benchmark Machine Learning (ML) models that can generate multiple appropriate facial reactions (including facial image sequences and their attributes) given an input conversational partner's stimulus under various dyadic video conference scenarios. This paper presents: (i) the guidelines of the REACT 2024 challenge; (ii) the dataset utilized in the challenge; and (iii) the performance of the baseline systems on the two proposed sub-challenges: Offline Multiple Appropriate Facial Reaction Generation and Online Multiple Appropriate Facial Reaction Generation, respectively. The challenge baseline code is publicly available at https://github.com/reactmultimodalchallenge/baseline_react2024.