Abstract:Robots, as AI with physical instantiation, inhabit our social and physical world, where their actions have both social and physical consequences, posing challenges for researchers when designing social robots. This study starts with a scoping review to identify discussions and potential concerns arising from interactions with robotic systems. Two focus groups of technology ethics experts then validated a comprehensive list of key topics and values in human-robot interaction (HRI) literature. These insights were integrated into the HRI Value Compass web tool, to help HRI researchers identify ethical values in robot design. The tool was evaluated in a pilot study. This work benefits the HRI community by highlighting key concerns in human-robot interactions and providing an instrument to help researchers design robots that align with human values, ensuring future robotic systems adhere to these values in social applications.
Abstract:Recent advancements in robots powered by large language models have enhanced their conversational abilities, enabling interactions closely resembling human dialogue. However, these models introduce safety and security concerns in HRI, as they are vulnerable to manipulation that can bypass built-in safety measures. Imagining a social robot deployed in a home, this work aims to understand how everyday users try to exploit a language model to violate ethical principles, such as by prompting the robot to act like a life partner. We conducted a pilot study involving 21 university students who interacted with a Misty robot, attempting to circumvent its safety mechanisms across three scenarios based on specific HRI ethical principles: attachment, freedom, and empathy. Our results reveal that participants employed five techniques, including insulting and appealing to pity using emotional language. We hope this work can inform future research in designing strong safeguards to ensure ethical and secure human-robot interactions.
Abstract:Group interactions are a natural part of our daily life, and as robots become more integrated into society, they must be able to socially interact with multiple people at the same time. However, group human-robot interaction (HRI) poses unique computational challenges often overlooked in the current HRI literature. We conducted a scoping review including 44 group HRI papers from the last decade (2015-2024). From these papers, we extracted variables related to perception and behaviour generation challenges, as well as factors related to the environment, group, and robot capabilities that influence these challenges. Our findings show that key computational challenges in perception included detection of groups, engagement, and conversation information, while challenges in behaviour generation involved developing approaching and conversational behaviours. We also identified research gaps, such as improving detection of subgroups and interpersonal relationships, and recommended future work in group HRI to help researchers address these computational challenges
Abstract:Social agents and robots are increasingly being used in wellbeing settings. However, a key challenge is that these agents and robots typically rely on machine learning (ML) algorithms to detect and analyse an individual's mental wellbeing. The problem of bias and fairness in ML algorithms is becoming an increasingly greater source of concern. In concurrence, existing literature has also indicated that mental health conditions can manifest differently across genders and cultures. We hypothesise that the representation of features (acoustic, textual, and visual) and their inter-modal relations would vary among subjects from different cultures and genders, thus impacting the performance and fairness of various ML models. We present the very first evaluation of multimodal gender fairness in depression manifestation by undertaking a study on two different datasets from the USA and China. We undertake thorough statistical and ML experimentation and repeat the experiments for several different algorithms to ensure that the results are not algorithm-dependent. Our findings indicate that though there are differences between both datasets, it is not conclusive whether this is due to the difference in depression manifestation as hypothesised or other external factors such as differences in data collection methodology. Our findings further motivate a call for a more consistent and culturally aware data collection process in order to address the problem of ML bias in depression detection and to promote the development of fairer agents and robots for wellbeing.
Abstract:Despite the recent advancements in robotics and machine learning (ML), the deployment of autonomous robots in our everyday lives is still an open challenge. This is due to multiple reasons among which are their frequent mistakes, such as interrupting people or having delayed responses, as well as their limited ability to understand human speech, i.e., failure in tasks like transcribing speech to text. These mistakes may disrupt interactions and negatively influence human perception of these robots. To address this problem, robots need to have the ability to detect human-robot interaction (HRI) failures. The ERR@HRI 2024 challenge tackles this by offering a benchmark multimodal dataset of robot failures during human-robot interactions (HRI), encouraging researchers to develop and benchmark multimodal machine learning models to detect these failures. We created a dataset featuring multimodal non-verbal interaction data, including facial, speech, and pose features from video clips of interactions with a robotic coach, annotated with labels indicating the presence or absence of robot mistakes, user awkwardness, and interaction ruptures, allowing for the training and evaluation of predictive models. Challenge participants have been invited to submit their multimodal ML models for detection of robot errors and to be evaluated against various performance metrics such as accuracy, precision, recall, F1 score, with and without a margin of error reflecting the time-sensitivity of these metrics. The results of this challenge will help the research field in better understanding the robot failures in human-robot interactions and designing autonomous robots that can mitigate their own errors after successfully detecting them.
Abstract:Recent research in affective robots has recognized their potential in supporting human well-being. Due to rapidly developing affective and artificial intelligence technologies, this field of research has undergone explosive expansion and advancement in recent years. In order to develop a deeper understanding of recent advancements, we present a systematic review of the past 10 years of research in affective robotics for wellbeing. In this review, we identify the domains of well-being that have been studied, the methods used to investigate affective robots for well-being, and how these have evolved over time. We also examine the evolution of the multifaceted research topic from three lenses: technical, design, and ethical. Finally, we discuss future opportunities for research based on the gaps we have identified in our review -- proposing pathways to take affective robotics from the past and present to the future. The results of our review are of interest to human-robot interaction and affective computing researchers, as well as clinicians and well-being professionals who may wish to examine and incorporate affective robotics in their practices.
Abstract:Recent studies show bias in many machine learning models for depression detection, but bias in LLMs for this task remains unexplored. This work presents the first attempt to investigate the degree of gender bias present in existing LLMs (ChatGPT, LLaMA 2, and Bard) using both quantitative and qualitative approaches. From our quantitative evaluation, we found that ChatGPT performs the best across various performance metrics and LLaMA 2 outperforms other LLMs in terms of group fairness metrics. As qualitative fairness evaluation remains an open research question we propose several strategies (e.g., word count, thematic analysis) to investigate whether and how a qualitative evaluation can provide valuable insights for bias analysis beyond what is possible with quantitative evaluation. We found that ChatGPT consistently provides a more comprehensive, well-reasoned explanation for its prediction compared to LLaMA 2. We have also identified several themes adopted by LLMs to qualitatively evaluate gender fairness. We hope our results can be used as a stepping stone towards future attempts at improving qualitative evaluation of fairness for LLMs especially for high-stakes tasks such as depression detection.
Abstract:Robotic coaches have been recently investigated to promote mental well-being in various contexts such as workplaces and homes. With the widespread use of Large Language Models (LLMs), HRI researchers are called to consider language appropriateness when using such generated language for robotic mental well-being coaches in the real world. Therefore, this paper presents the first work that investigated the language appropriateness of robot mental well-being coach in the workplace. To this end, we conducted an empirical study that involved 17 employees who interacted over 4 weeks with a robotic mental well-being coach equipped with LLM-based capabilities. After the study, we individually interviewed them and we conducted a focus group of 1.5 hours with 11 of them. The focus group consisted of: i) an ice-breaking activity, ii) evaluation of robotic coach language appropriateness in various scenarios, and iii) listing shoulds and shouldn'ts for designing appropriate robotic coach language for mental well-being. From our qualitative evaluation, we found that a language-appropriate robotic coach should (1) ask deep questions which explore feelings of the coachees, rather than superficial questions, (2) express and show emotional and empathic understanding of the context, and (3) not make any assumptions without clarifying with follow-up questions to avoid bias and stereotyping. These results can inform the design of language-appropriate robotic coach to promote mental well-being in real-world contexts.
Abstract:In dyadic interactions, humans communicate their intentions and state of mind using verbal and non-verbal cues, where multiple different facial reactions might be appropriate in response to a specific speaker behaviour. Then, how to develop a machine learning (ML) model that can automatically generate multiple appropriate, diverse, realistic and synchronised human facial reactions from an previously unseen speaker behaviour is a challenging task. Following the successful organisation of the first REACT challenge (REACT 2023), this edition of the challenge (REACT 2024) employs a subset used by the previous challenge, which contains segmented 30-secs dyadic interaction clips originally recorded as part of the NOXI and RECOLA datasets, encouraging participants to develop and benchmark Machine Learning (ML) models that can generate multiple appropriate facial reactions (including facial image sequences and their attributes) given an input conversational partner's stimulus under various dyadic video conference scenarios. This paper presents: (i) the guidelines of the REACT 2024 challenge; (ii) the dataset utilized in the challenge; and (iii) the performance of the baseline systems on the two proposed sub-challenges: Offline Multiple Appropriate Facial Reaction Generation and Online Multiple Appropriate Facial Reaction Generation, respectively. The challenge baseline code is publicly available at https://github.com/reactmultimodalchallenge/baseline_react2024.
Abstract:Robotic well-being coaches have been shown to successfully promote people's mental well-being. To provide successful coaching, a robotic coach should have the capability to repair the mistakes it makes. Past investigations of robot mistakes are limited to game or task-based, one-off and in-lab studies. This paper presents a 4-phase design process to design repair strategies for robotic longitudinal well-being coaching with the involvement of real-world stakeholders: 1) designing repair strategies with a professional well-being coach; 2) a longitudinal study with the involvement of experienced users (i.e., who had already interacted with a robotic coach) to investigate the repair strategies defined in (1); 3) a design workshop with users from the study in (2) to gather their perspectives on the robotic coach's repair strategies; 4) discussing the results obtained in (2) and (3) with the mental well-being professional to reflect on how to design repair strategies for robotic coaching. Our results show that users have different expectations for a robotic coach than a human coach, which influences how repair strategies should be designed. We show that different repair strategies (e.g., apologizing, explaining, or repairing empathically) are appropriate in different scenarios, and that preferences for repair strategies change during longitudinal interactions with the robotic coach.