Abstract:Maps of dynamics are effective representations of motion patterns learned from prior observations, with recent research demonstrating their ability to enhance performance in various downstream tasks such as human-aware robot navigation, long-term human motion prediction, and robot localization. Current advancements have primarily concentrated on methods for learning maps of human flow in environments where the flow is static, i.e., not assumed to change over time. In this paper we propose a method to update the CLiFF-map, one type of map of dynamics, for achieving efficient life-long robot operation. As new observations are collected, our goal is to update a CLiFF-map to effectively and accurately integrate new observations, while retaining relevant historic motion patterns. The proposed online update method maintains a probabilistic representation in each observed location, updating parameters by continuously tracking sufficient statistics. In experiments using both synthetic and real-world datasets, we show that our method is able to maintain accurate representations of human motion dynamics, contributing to high performance flow-compliant planning downstream tasks, while being orders of magnitude faster than the comparable baselines.
Abstract:Integrating multimodal foundation models has significantly enhanced autonomous agents' language comprehension, perception, and planning capabilities. However, while existing works adopt a \emph{task-centric} approach with minimal human interaction, applying these models to developing assistive \emph{user-centric} robots that can interact and cooperate with humans remains underexplored. This paper introduces ``Bident'', a framework designed to integrate robots seamlessly into shared spaces with humans. Bident enhances the interactive experience by incorporating multimodal inputs like speech and user gaze dynamics. Furthermore, Bident supports verbal utterances and physical actions like gestures, making it versatile for bidirectional human-robot interactions. Potential applications include personalized education, where robots can adapt to individual learning styles and paces, and healthcare, where robots can offer personalized support, companionship, and everyday assistance in the home and workplace environments.
Abstract:The human gaze is an important cue to signal intention, attention, distraction, and the regions of interest in the immediate surroundings. Gaze tracking can transform how robots perceive, understand, and react to people, enabling new modes of robot control, interaction, and collaboration. In this paper, we use gaze tracking data from a rich dataset of human motion (TH\"OR-MAGNI) to investigate the coordination between gaze direction and head rotation of humans engaged in various indoor activities involving navigation, interaction with objects, and collaboration with a mobile robot. In particular, we study the spread and central bias of fixations in diverse activities and examine the correlation between gaze direction and head rotation. We introduce various human motion metrics to enhance the understanding of gaze behavior in dynamic interactions. Finally, we apply semantic object labeling to decompose the gaze distribution into activity-relevant regions.
Abstract:Social robots, owing to their embodied physical presence in human spaces and the ability to directly interact with the users and their environment, have a great potential to support children in various activities in education, healthcare and daily life. Child-Robot Interaction (CRI), as any domain involving children, inevitably faces the major challenge of designing generalized strategies to work with unique, turbulent and very diverse individuals. Addressing this challenging endeavor requires to combine the standpoint of the robot-centered perspective, i.e. what robots technically can and are best positioned to do, with that of the child-centered perspective, i.e. what children may gain from the robot and how the robot should act to best support them in reaching the goals of the interaction. This article aims to help researchers bridge the two perspectives and proposes to address the development of CRI scenarios with insights from child psychology and child development theories. To that end, we review the outcomes of the CRI studies, outline common trends and challenges, and identify two key factors from child psychology that impact child-robot interactions, especially in a long-term perspective: developmental stage and individual characteristics. For both of them we discuss prospective experiment designs which support building naturally engaging and sustainable interactions.
Abstract:Long-term human motion prediction (LHMP) is essential for safely operating autonomous robots and vehicles in populated environments. It is fundamental for various applications, including motion planning, tracking, human-robot interaction and safety monitoring. However, accurate prediction of human trajectories is challenging due to complex factors, including, for example, social norms and environmental conditions. The influence of such factors can be captured through Maps of Dynamics (MoDs), which encode spatial motion patterns learned from (possibly scattered and partial) past observations of motion in the environment and which can be used for data-efficient, interpretable motion prediction (MoD-LHMP). To address the limitations of prior work, especially regarding accuracy and sensitivity to anomalies in long-term prediction, we propose the Laminar Component Enhanced LHMP approach (LaCE-LHMP). Our approach is inspired by data-driven airflow modelling, which estimates laminar and turbulent flow components and uses predominantly the laminar components to make flow predictions. Based on the hypothesis that human trajectory patterns also manifest laminar flow (that represents predictable motion) and turbulent flow components (that reflect more unpredictable and arbitrary motion), LaCE-LHMP extracts the laminar patterns in human dynamics and uses them for human motion prediction. We demonstrate the superior prediction performance of LaCE-LHMP through benchmark comparisons with state-of-the-art LHMP methods, offering an unconventional perspective and a more intuitive understanding of human movement patterns.
Abstract:We present a new large dataset of indoor human and robot navigation and interaction, called TH\"OR-MAGNI, that is designed to facilitate research on social navigation: e.g., modelling and predicting human motion, analyzing goal-oriented interactions between humans and robots, and investigating visual attention in a social interaction context. TH\"OR-MAGNI was created to fill a gap in available datasets for human motion analysis and HRI. This gap is characterized by a lack of comprehensive inclusion of exogenous factors and essential target agent cues, which hinders the development of robust models capable of capturing the relationship between contextual cues and human behavior in different scenarios. Unlike existing datasets, TH\"OR-MAGNI includes a broader set of contextual features and offers multiple scenario variations to facilitate factor isolation. The dataset includes many social human-human and human-robot interaction scenarios, rich context annotations, and multi-modal data, such as walking trajectories, gaze tracking data, and lidar and camera streams recorded from a mobile robot. We also provide a set of tools for visualization and processing of the recorded data. TH\"OR-MAGNI is, to the best of our knowledge, unique in the amount and diversity of sensor data collected in a contextualized and socially dynamic environment, capturing natural human-robot interactions.
Abstract:Human motion prediction is important for mobile service robots and intelligent vehicles to operate safely and smoothly around people. The more accurate predictions are, particularly over extended periods of time, the better a system can, e.g., assess collision risks and plan ahead. In this paper, we propose to exploit maps of dynamics (MoDs, a class of general representations of place-dependent spatial motion patterns, learned from prior observations) for long-term human motion prediction (LHMP). We present a new MoD-informed human motion prediction approach, named CLiFF-LHMP, which is data efficient, explainable, and insensitive to errors from an upstream tracking system. Our approach uses CLiFF-map, a specific MoD trained with human motion data recorded in the same environment. We bias a constant velocity prediction with samples from the CLiFF-map to generate multi-modal trajectory predictions. In two public datasets we show that this algorithm outperforms the state of the art for predictions over very extended periods of time, achieving 45% more accurate prediction performance at 50s compared to the baseline.
Abstract:Robots are increasingly used in shared environments with humans, making effective communication a necessity for successful human-robot interaction. In our work, we study a crucial component: active communication of robot intent. Here, we present an anthropomorphic solution where a humanoid robot communicates the intent of its host robot acting as an "Anthropomorphic Robotic Mock Driver" (ARMoD). We evaluate this approach in two experiments in which participants work alongside a mobile robot on various tasks, while the ARMoD communicates a need for human attention, when required, or gives instructions to collaborate on a joint task. The experiments feature two interaction styles of the ARMoD: a verbal-only mode using only speech and a multimodal mode, additionally including robotic gaze and pointing gestures to support communication and register intent in space. Our results show that the multimodal interaction style, including head movements and eye gaze as well as pointing gestures, leads to more natural fixation behavior. Participants naturally identified and fixated longer on the areas relevant for intent communication, and reacted faster to instructions in collaborative tasks. Our research further indicates that the ARMoD intent communication improves engagement and social interaction with mobile robots in workplace settings.
Abstract:Human motion prediction is essential for the safe and smooth operation of mobile service robots and intelligent vehicles around people. Commonly used neural network-based approaches often require large amounts of complete trajectories to represent motion dynamics in complex semantically-rich spaces. This requirement may complicate deployment of physical systems in new environments, especially when the data is being collected online from onboard sensors. In this paper we explore a data-efficient alternative using maps of dynamics (MoD) to represent place-dependent multi-modal spatial motion patterns, learned from prior observations. Our approach can perform efficient human motion prediction in the long-term perspective of up to 60 seconds. We quantitatively evaluate its accuracy with limited amount of training data in comparison to an LSTM-based baseline, and qualitatively show that the predicted trajectories reflect the natural semantic properties of the environment, e.g. the locations of short- and long-term goals, navigation in narrow passages, around obstacles, etc.
Abstract:Robots are increasingly deployed in spaces shared with humans, including home settings and industrial environments. In these environments, the interaction between humans and robots (HRI) is crucial for safety, legibility, and efficiency. A key factor in HRI is trust, which modulates the acceptance of the system. Anthropomorphism has been shown to modulate trust development in a robot, but robots in industrial environments are not usually anthropomorphic. We designed a simple interaction in an industrial environment in which an anthropomorphic mock driver (ARMoD) robot simulates to drive an autonomous guided vehicle (AGV). The task consisted of a human crossing paths with the AGV, with or without the ARMoD mounted on the top, in a narrow corridor. The human and the system needed to negotiate trajectories when crossing paths, meaning that the human had to attend to the trajectory of the robot to avoid a collision with it. There was a significant increment in the reported trust scores in the condition where the ARMoD was present, showing that the presence of an anthropomorphic robot is enough to modulate the trust, even in limited interactions as the one we present here.