Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Michel Valstar

Human strategies for correcting `human-robot' errors during a laundry sorting task

Apr 11, 2025

Pepita Barnard, Maria J Galvez Trigo, Dominic Price, Sue Cobb, Gisela Reyes-Cruz, Gustavo Berumen, David Branson III, Mojtaba A. Khanesar, Mercedes Torres Torres, Michel Valstar

Abstract:Mental models and expectations underlying human-human interaction (HHI) inform human-robot interaction (HRI) with domestic robots. To ease collaborative home tasks by improving domestic robot speech and behaviours for human-robot communication, we designed a study to understand how people communicated when failure occurs. To identify patterns of natural communication, particularly in response to robotic failures, participants instructed Laundrobot to move laundry into baskets using natural language and gestures. Laundrobot either worked error-free, or in one of two error modes. Participants were not advised Laundrobot would be a human actor, nor given information about error modes. Video analysis from 42 participants found speech patterns, included laughter, verbal expressions, and filler words, such as ``oh'' and ``ok'', also, sequences of body movements, including touching one's own face, increased pointing with a static finger, and expressions of surprise. Common strategies deployed when errors occurred, included correcting and teaching, taking responsibility, and displays of frustration. The strength of reaction to errors diminished with exposure, possibly indicating acceptance or resignation. Some used strategies similar to those used to communicate with other technologies, such as smart assistants. An anthropomorphic robot may not be ideally suited to this kind of task. Laundrobot's appearance, morphology, voice, capabilities, and recovery strategies may have impacted how it was perceived. Some participants indicated Laundrobot's actual skills were not aligned with expectations; this made it difficult to know what to expect and how much Laundrobot understood. Expertise, personality, and cultural differences may affect responses, however these were not assessed.

Via

Access Paper or Ask Questions

REACT 2024: the Second Multiple Appropriate Facial Reaction Generation Challenge

Jan 10, 2024

Siyang Song, Micol Spitale, Cheng Luo, Cristina Palmero, German Barquero, Hengde Zhu, Sergio Escalera, Michel Valstar, Tobias Baur, Fabien Ringeval(+2 more)

Figure 1 for REACT 2024: the Second Multiple Appropriate Facial Reaction Generation Challenge

Figure 2 for REACT 2024: the Second Multiple Appropriate Facial Reaction Generation Challenge

Figure 3 for REACT 2024: the Second Multiple Appropriate Facial Reaction Generation Challenge

Figure 4 for REACT 2024: the Second Multiple Appropriate Facial Reaction Generation Challenge

Abstract:In dyadic interactions, humans communicate their intentions and state of mind using verbal and non-verbal cues, where multiple different facial reactions might be appropriate in response to a specific speaker behaviour. Then, how to develop a machine learning (ML) model that can automatically generate multiple appropriate, diverse, realistic and synchronised human facial reactions from an previously unseen speaker behaviour is a challenging task. Following the successful organisation of the first REACT challenge (REACT 2023), this edition of the challenge (REACT 2024) employs a subset used by the previous challenge, which contains segmented 30-secs dyadic interaction clips originally recorded as part of the NOXI and RECOLA datasets, encouraging participants to develop and benchmark Machine Learning (ML) models that can generate multiple appropriate facial reactions (including facial image sequences and their attributes) given an input conversational partner's stimulus under various dyadic video conference scenarios. This paper presents: (i) the guidelines of the REACT 2024 challenge; (ii) the dataset utilized in the challenge; and (iii) the performance of the baseline systems on the two proposed sub-challenges: Offline Multiple Appropriate Facial Reaction Generation and Online Multiple Appropriate Facial Reaction Generation, respectively. The challenge baseline code is publicly available at https://github.com/reactmultimodalchallenge/baseline_react2024.

Via

Access Paper or Ask Questions

REACT2023: the first Multi-modal Multiple Appropriate Facial Reaction Generation Challenge

Jun 11, 2023

Siyang Song, Micol Spitale, Cheng Luo, German Barquero, Cristina Palmero, Sergio Escalera, Michel Valstar, Tobias Baur, Fabien Ringeval, Elisabeth Andre(+1 more)

Figure 1 for REACT2023: the first Multi-modal Multiple Appropriate Facial Reaction Generation Challenge

Figure 2 for REACT2023: the first Multi-modal Multiple Appropriate Facial Reaction Generation Challenge

Figure 3 for REACT2023: the first Multi-modal Multiple Appropriate Facial Reaction Generation Challenge

Figure 4 for REACT2023: the first Multi-modal Multiple Appropriate Facial Reaction Generation Challenge

Abstract:The Multi-modal Multiple Appropriate Facial Reaction Generation Challenge (REACT2023) is the first competition event focused on evaluating multimedia processing and machine learning techniques for generating human-appropriate facial reactions in various dyadic interaction scenarios, with all participants competing strictly under the same conditions. The goal of the challenge is to provide the first benchmark test set for multi-modal information processing and to foster collaboration among the audio, visual, and audio-visual affective computing communities, to compare the relative merits of the approaches to automatic appropriate facial reaction generation under different spontaneous dyadic interaction conditions. This paper presents: (i) novelties, contributions and guidelines of the REACT2023 challenge; (ii) the dataset utilized in the challenge; and (iii) the performance of baseline systems on the two proposed sub-challenges: Offline Multiple Appropriate Facial Reaction Generation and Online Multiple Appropriate Facial Reaction Generation, respectively. The challenge baseline code is publicly available at \url{https://github.com/reactmultimodalchallenge/baseline_react2023}.

Via

Access Paper or Ask Questions

Are 3D Face Shapes Expressive Enough for Recognising Continuous Emotions and Action Unit Intensities?

Jul 03, 2022

Mani Kumar Tellamekala, Ömer Sümer, Björn W. Schuller, Elisabeth André, Timo Giesbrecht, Michel Valstar

Figure 1 for Are 3D Face Shapes Expressive Enough for Recognising Continuous Emotions and Action Unit Intensities?

Figure 2 for Are 3D Face Shapes Expressive Enough for Recognising Continuous Emotions and Action Unit Intensities?

Figure 3 for Are 3D Face Shapes Expressive Enough for Recognising Continuous Emotions and Action Unit Intensities?

Figure 4 for Are 3D Face Shapes Expressive Enough for Recognising Continuous Emotions and Action Unit Intensities?

Abstract:Recognising continuous emotions and action unit (AU) intensities from face videos requires a spatial and temporal understanding of expression dynamics. Existing works primarily rely on 2D face appearances to extract such dynamics. This work focuses on a promising alternative based on parametric 3D face shape alignment models, which disentangle different factors of variation, including expression-induced shape variations. We aim to understand how expressive 3D face shapes are in estimating valence-arousal and AU intensities compared to the state-of-the-art 2D appearance-based models. We benchmark four recent 3D face alignment models: ExpNet, 3DDFA-V2, DECA, and EMOCA. In valence-arousal estimation, expression features of 3D face models consistently surpassed previous works and yielded an average concordance correlation of .739 and .574 on SEWA and AVEC 2019 CES corpora, respectively. We also study how 3D face shapes performed on AU intensity estimation on BP4D and DISFA datasets, and report that 3D face features were on par with 2D appearance features in AUs 4, 6, 10, 12, and 25, but not the entire set of AUs. To understand this discrepancy, we conduct a correspondence analysis between valence-arousal and AUs, which points out that accurate prediction of valence-arousal may require the knowledge of only a few AUs.

Via

Access Paper or Ask Questions

COLD Fusion: Calibrated and Ordinal Latent Distribution Fusion for Uncertainty-Aware Multimodal Emotion Recognition

Jun 12, 2022

Mani Kumar Tellamekala, Shahin Amiriparian, Björn W. Schuller, Elisabeth André, Timo Giesbrecht, Michel Valstar

Figure 1 for COLD Fusion: Calibrated and Ordinal Latent Distribution Fusion for Uncertainty-Aware Multimodal Emotion Recognition

Figure 2 for COLD Fusion: Calibrated and Ordinal Latent Distribution Fusion for Uncertainty-Aware Multimodal Emotion Recognition

Figure 3 for COLD Fusion: Calibrated and Ordinal Latent Distribution Fusion for Uncertainty-Aware Multimodal Emotion Recognition

Figure 4 for COLD Fusion: Calibrated and Ordinal Latent Distribution Fusion for Uncertainty-Aware Multimodal Emotion Recognition

Abstract:Automatically recognising apparent emotions from face and voice is hard, in part because of various sources of uncertainty, including in the input data and the labels used in a machine learning framework. This paper introduces an uncertainty-aware audiovisual fusion approach that quantifies modality-wise uncertainty towards emotion prediction. To this end, we propose a novel fusion framework in which we first learn latent distributions over audiovisual temporal context vectors separately, and then constrain the variance vectors of unimodal latent distributions so that they represent the amount of information each modality provides w.r.t. emotion recognition. In particular, we impose Calibration and Ordinal Ranking constraints on the variance vectors of audiovisual latent distributions. When well-calibrated, modality-wise uncertainty scores indicate how much their corresponding predictions may differ from the ground truth labels. Well-ranked uncertainty scores allow the ordinal ranking of different frames across the modalities. To jointly impose both these constraints, we propose a softmax distributional matching loss. In both classification and regression settings, we compare our uncertainty-aware fusion model with standard model-agnostic fusion baselines. Our evaluation on two emotion recognition corpora, AVEC 2019 CES and IEMOCAP, shows that audiovisual emotion recognition can considerably benefit from well-calibrated and well-ranked latent uncertainty measures.

Via

Access Paper or Ask Questions

Continuous-Time Audiovisual Fusion with Recurrence vs. Attention for In-The-Wild Affect Recognition

Mar 29, 2022

Vincent Karas, Mani Kumar Tellamekala, Adria Mallol-Ragolta, Michel Valstar, Björn W. Schuller

Figure 1 for Continuous-Time Audiovisual Fusion with Recurrence vs. Attention for In-The-Wild Affect Recognition

Figure 2 for Continuous-Time Audiovisual Fusion with Recurrence vs. Attention for In-The-Wild Affect Recognition

Figure 3 for Continuous-Time Audiovisual Fusion with Recurrence vs. Attention for In-The-Wild Affect Recognition

Figure 4 for Continuous-Time Audiovisual Fusion with Recurrence vs. Attention for In-The-Wild Affect Recognition

Abstract:In this paper, we present our submission to 3rd Affective Behavior Analysis in-the-wild (ABAW) challenge. Learningcomplex interactions among multimodal sequences is critical to recognise dimensional affect from in-the-wild audiovisual data. Recurrence and attention are the two widely used sequence modelling mechanisms in the literature. To clearly understand the performance differences between recurrent and attention models in audiovisual affect recognition, we present a comprehensive evaluation of fusion models based on LSTM-RNNs, self-attention and cross-modal attention, trained for valence and arousal estimation. Particularly, we study the impact of some key design choices: the modelling complexity of CNN backbones that provide features to the the temporal models, with and without end-to-end learning. We trained the audiovisual affect recognition models on in-the-wild ABAW corpus by systematically tuning the hyper-parameters involved in the network architecture design and training optimisation. Our extensive evaluation of the audiovisual fusion models shows that LSTM-RNNs can outperform the attention models when coupled with low-complex CNN backbones and trained in an end-to-end fashion, implying that attention models may not necessarily be the optimal choice for continuous-time multimodal emotion recognition.

* 10 pages, 1 figures, added references and an overview figure

Via

Access Paper or Ask Questions

Two-stage Temporal Modelling Framework for Video-based Depression Recognition using Graph Representation

Nov 30, 2021

Jiaqi Xu, Siyang Song, Keerthy Kusumam, Hatice Gunes, Michel Valstar

Figure 1 for Two-stage Temporal Modelling Framework for Video-based Depression Recognition using Graph Representation

Figure 2 for Two-stage Temporal Modelling Framework for Video-based Depression Recognition using Graph Representation

Figure 3 for Two-stage Temporal Modelling Framework for Video-based Depression Recognition using Graph Representation

Figure 4 for Two-stage Temporal Modelling Framework for Video-based Depression Recognition using Graph Representation

Abstract:Video-based automatic depression analysis provides a fast, objective and repeatable self-assessment solution, which has been widely developed in recent years. While depression clues may be reflected by human facial behaviours of various temporal scales, most existing approaches either focused on modelling depression from short-term or video-level facial behaviours. In this sense, we propose a two-stage framework that models depression severity from multi-scale short-term and video-level facial behaviours. The short-term depressive behaviour modelling stage first deep learns depression-related facial behavioural features from multiple short temporal scales, where a Depression Feature Enhancement (DFE) module is proposed to enhance the depression-related clues for all temporal scales and remove non-depression noises. Then, the video-level depressive behaviour modelling stage proposes two novel graph encoding strategies, i.e., Sequential Graph Representation (SEG) and Spectral Graph Representation (SPG), to re-encode all short-term features of the target video into a video-level graph representation, summarizing depression-related multi-scale video-level temporal information. As a result, the produced graph representations predict depression severity using both short-term and long-term facial beahviour patterns. The experimental results on AVEC 2013 and AVEC 2014 datasets show that the proposed DFE module constantly enhanced the depression severity estimation performance for various CNN models while the SPG is superior than other video-level modelling methods. More importantly, the result achieved for the proposed two-stage framework shows its promising and solid performance compared to widely-used one-stage modelling approaches.

Via

Access Paper or Ask Questions

Learning Graph Representation of Person-specific Cognitive Processes from Audio-visual Behaviours for Automatic Personality Recognition

Oct 27, 2021

Siyang Song, Zilong Shao, Shashank Jaiswal, Linlin Shen, Michel Valstar, Hatice Gunes

Figure 1 for Learning Graph Representation of Person-specific Cognitive Processes from Audio-visual Behaviours for Automatic Personality Recognition

Figure 2 for Learning Graph Representation of Person-specific Cognitive Processes from Audio-visual Behaviours for Automatic Personality Recognition

Figure 3 for Learning Graph Representation of Person-specific Cognitive Processes from Audio-visual Behaviours for Automatic Personality Recognition

Figure 4 for Learning Graph Representation of Person-specific Cognitive Processes from Audio-visual Behaviours for Automatic Personality Recognition

Abstract:This approach builds on two following findings in cognitive science: (i) human cognition partially determines expressed behaviour and is directly linked to true personality traits; and (ii) in dyadic interactions individuals' nonverbal behaviours are influenced by their conversational partner behaviours. In this context, we hypothesise that during a dyadic interaction, a target subject's facial reactions are driven by two main factors, i.e. their internal (person-specific) cognitive process, and the externalised nonverbal behaviours of their conversational partner. Consequently, we propose to represent the target subjects (defined as the listener) person-specific cognition in the form of a person-specific CNN architecture that has unique architectural parameters and depth, which takes audio-visual non-verbal cues displayed by the conversational partner (defined as the speaker) as input, and is able to reproduce the target subject's facial reactions. Each person-specific CNN is explored by the Neural Architecture Search (NAS) and a novel adaptive loss function, which is then represented as a graph representation for recognising the target subject's true personality. Experimental results not only show that the produced graph representations are well associated with target subjects' personality traits in both human-human and human-machine interaction scenarios, and outperform the existing approaches with significant advantages, but also demonstrate that the proposed novel strategies such as adaptive loss, and the end-to-end vertices/edges feature learning, help the proposed approach in learning more reliable personality representations.

* Submitted to IJCV

Via

Access Paper or Ask Questions

Affective Processes: stochastic modelling of temporal context for emotion and facial expression recognition

Mar 24, 2021

Enrique Sanchez, Mani Kumar Tellamekala, Michel Valstar, Georgios Tzimiropoulos

Figure 1 for Affective Processes: stochastic modelling of temporal context for emotion and facial expression recognition

Figure 2 for Affective Processes: stochastic modelling of temporal context for emotion and facial expression recognition

Figure 3 for Affective Processes: stochastic modelling of temporal context for emotion and facial expression recognition

Figure 4 for Affective Processes: stochastic modelling of temporal context for emotion and facial expression recognition

Abstract:Temporal context is key to the recognition of expressions of emotion. Existing methods, that rely on recurrent or self-attention models to enforce temporal consistency, work on the feature level, ignoring the task-specific temporal dependencies, and fail to model context uncertainty. To alleviate these issues, we build upon the framework of Neural Processes to propose a method for apparent emotion recognition with three key novel components: (a) probabilistic contextual representation with a global latent variable model; (b) temporal context modelling using task-specific predictions in addition to features; and (c) smart temporal context selection. We validate our approach on four databases, two for Valence and Arousal estimation (SEWA and AffWild2), and two for Action Unit intensity estimation (DISFA and BP4D). Results show a consistent improvement over a series of strong baselines as well as over state-of-the-art methods.

* Accepted at CVPR 2021

Via

Access Paper or Ask Questions

A recurrent cycle consistency loss for progressive face-to-face synthesis

Apr 14, 2020

Enrique Sanchez, Michel Valstar

Figure 1 for A recurrent cycle consistency loss for progressive face-to-face synthesis

Figure 2 for A recurrent cycle consistency loss for progressive face-to-face synthesis

Figure 3 for A recurrent cycle consistency loss for progressive face-to-face synthesis

Figure 4 for A recurrent cycle consistency loss for progressive face-to-face synthesis

Abstract:This paper addresses a major flaw of the cycle consistency loss when used to preserve the input appearance in the face-to-face synthesis domain. In particular, we show that the images generated by a network trained using this loss conceal a noise that hinders their use for further tasks. To overcome this limitation, we propose a ''recurrent cycle consistency loss" which for different sequences of target attributes minimises the distance between the output images, independent of any intermediate step. We empirically validate not only that our loss enables the re-use of generated images, but that it also improves their quality. In addition, we propose the very first network that covers the task of unconstrained landmark-guided face-to-face synthesis. Contrary to previous works, our proposed approach enables the transfer of a particular set of input features to a large span of poses and expressions, whereby the target landmarks become the ground-truth points. We then evaluate the consistency of our proposed approach to synthesise faces at the target landmarks. To the best of our knowledge, we are the first to propose a loss to overcome the limitation of the cycle consistency loss, and the first to propose an ''in-the-wild'' landmark guided synthesis approach. Code and models for this paper can be found in https://github.com/ESanchezLozano/GANnotation

* Accepted to FG 2020 (Oral). arXiv admin note: substantial text overlap with arXiv:1811.03492

Via

Access Paper or Ask Questions