Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Andreas Bulling

QualitEye: Public and Privacy-preserving Gaze Data Quality Verification

Jun 06, 2025

Mayar Elfares, Pascal Reisert, Ralf Küsters, Andreas Bulling

Abstract:Gaze-based applications are increasingly advancing with the availability of large datasets but ensuring data quality presents a substantial challenge when collecting data at scale. It further requires different parties to collaborate, therefore, privacy concerns arise. We propose QualitEye--the first method for verifying image-based gaze data quality. QualitEye employs a new semantic representation of eye images that contains the information required for verification while excluding irrelevant information for better domain adaptation. QualitEye covers a public setting where parties can freely exchange data and a privacy-preserving setting where parties cannot reveal their raw data nor derive gaze features/labels of others with adapted private set intersection protocols. We evaluate QualitEye on the MPIIFaceGaze and GazeCapture datasets and achieve a high verification performance (with a small overhead in runtime for privacy-preserving versions). Hence, QualitEye paves the way for new gaze analysis methods at the intersection of machine learning, human-computer interaction, and cryptography.

Via

Access Paper or Ask Questions

HOIGaze: Gaze Estimation During Hand-Object Interactions in Extended Reality Exploiting Eye-Hand-Head Coordination

Apr 28, 2025

Zhiming Hu, Daniel Haeufle, Syn Schmitt, Andreas Bulling

Abstract:We present HOIGaze - a novel learning-based approach for gaze estimation during hand-object interactions (HOI) in extended reality (XR). HOIGaze addresses the challenging HOI setting by building on one key insight: The eye, hand, and head movements are closely coordinated during HOIs and this coordination can be exploited to identify samples that are most useful for gaze estimator training - as such, effectively denoising the training data. This denoising approach is in stark contrast to previous gaze estimation methods that treated all training samples as equal. Specifically, we propose: 1) a novel hierarchical framework that first recognises the hand currently visually attended to and then estimates gaze direction based on the attended hand; 2) a new gaze estimator that uses cross-modal Transformers to fuse head and hand-object features extracted using a convolutional neural network and a spatio-temporal graph convolutional network; and 3) a novel eye-head coordination loss that upgrades training samples belonging to the coordinated eye-head movements. We evaluate HOIGaze on the HOT3D and Aria digital twin (ADT) datasets and show that it significantly outperforms state-of-the-art methods, achieving an average improvement of 15.6% on HOT3D and 6.0% on ADT in mean angular error. To demonstrate the potential of our method, we further report significant performance improvements for the sample downstream task of eye-based activity recognition on ADT. Taken together, our results underline the significant information content available in eye-hand-head coordination and, as such, open up an exciting new direction for learning-based gaze estimation.

* Accepted at SIGGRAPH 2025, link: https://zhiminghu.net/hu25_hoigaze.html

Via

Access Paper or Ask Questions

Interactive Expressive Motion Generation Using Dynamic Movement Primitives

Apr 09, 2025

Till Hielscher, Andreas Bulling, Kai O. Arras

Abstract:Our goal is to enable social robots to interact autonomously with humans in a realistic, engaging, and expressive manner. The 12 Principles of Animation [1] are a well-established framework animators use to create movements that make characters appear convincing, dynamic, and emotionally expressive. This paper proposes a novel approach that leverages Dynamic Movement Primitives (DMPs) to implement key animation principles, providing a learnable, explainable, modulable, online adaptable and composable model for automatic expressive motion generation. DMPs, originally developed for general imitation learning in robotics and grounded in a spring-damper system design, offer mathematical properties that make them particularly suitable for this task. Specifically, they enable modulation of the intensities of individual principles and facilitate the decomposition of complex, expressive motion sequences into learnable and parametrizable primitives. We present the mathematical formulation of the parameterized animation principles and demonstrate the effectiveness of our framework through experiments and application on three robotic platforms with different kinematic configurations, in simulation, on actual robots and in a user study. Our results show that the approach allows for creating diverse and nuanced expressions using a single base model.

Via

Access Paper or Ask Questions

CLAD: Constrained Latent Action Diffusion for Vision-Language Procedure Planning

Mar 09, 2025

Lei Shi, Andreas Bulling

Abstract:We propose CLAD -- a Constrained Latent Action Diffusion model for vision-language procedure planning in instructional videos. Procedure planning is the challenging task of predicting intermediate actions given a visual observation of a start and a goal state. However, future interactive AI systems must also be able to plan procedures using multi-modal input, e.g., where visual observations are augmented with language descriptions. To tackle this vision-language procedure planning task, our method uses a Variational Autoencoder (VAE) to learn the latent representation of actions and observations as constraints and integrate them into the diffusion process. This approach exploits that the latent space of diffusion models already has semantics that can be used. We use the latent constraints to steer the diffusion model to better generate actions. We report extensive experiments on the popular CrossTask, Coin, and NIV datasets and show that our method outperforms state-of-the-art methods by a large margin. By evaluating ablated versions of our method, we further show that the proposed integration of the action and observation representations learnt in the VAE latent space is key to these performance improvements.

Via

Access Paper or Ask Questions

V$^2$Dial: Unification of Video and Visual Dialog via Multimodal Experts

Mar 03, 2025

Adnen Abdessaied, Anna Rohrbach, Marcus Rohrbach, Andreas Bulling

Abstract:We present V$^2$Dial - a novel expert-based model specifically geared towards simultaneously handling image and video input data for multimodal conversational tasks. Current multimodal models primarily focus on simpler tasks (e.g., VQA, VideoQA, video-text retrieval) and often neglect the more challenging conversational counterparts, such as video and visual/image dialog. Moreover, works on both conversational tasks evolved separately from each other despite their apparent similarities limiting their applicability potential. To this end, we propose to unify both tasks using a single model that for the first time jointly learns the spatial and temporal features of images and videos by routing them through dedicated experts and aligns them using matching and contrastive learning techniques. Furthermore, we systemically study the domain shift between the two tasks by investigating whether and to what extent these seemingly related tasks can mutually benefit from their respective training data. Extensive evaluations on the widely used video and visual dialog datasets of AVSD and VisDial show that our model achieves new state-of-the-art results across four benchmarks both in zero-shot and fine-tuning settings.

* CVPR 2025

Via

Access Paper or Ask Questions

HAIFAI: Human-AI Collaboration for Mental Face Reconstruction

Dec 09, 2024

Florian Strohm, Mihai Bâce, Andreas Bulling

Abstract:We present HAIFAI - a novel collaborative human-AI system to tackle the challenging task of reconstructing a visual representation of a face that exists only in a person's mind. Users iteratively rank images presented by the AI system based on their resemblance to a mental image. These rankings, in turn, allow the system to extract relevant image features, fuse them into a unified feature vector, and use a generative model to reconstruct the mental image. We also propose an extension called HAIFAI-X that allows users to manually refine and further improve the reconstruction using an easy-to-use slider interface. To avoid the need for tedious human data collection for model training, we introduce a computational user model of human ranking behaviour. For this, we collected a small face ranking dataset through an online crowd-sourcing study containing data from 275 participants. We evaluate HAIFAI and HAIFAI-X in a 12-participant user study and show that HAIFAI outperforms the previous state of the art regarding reconstruction quality, usability, perceived workload, and reconstruction speed. HAIFAI-X achieves even better reconstruction quality at the cost of reduced usability, perceived workload, and increased reconstruction time. We further validate the reconstructions in a subsequent face ranking study with 18 participants and show that HAIFAI-X achieves a new state-of-the-art identification rate of 60.6%. These findings represent a significant advancement towards developing new collaborative intelligent systems capable of reliably and effortlessly reconstructing a user's mental image.

Via

Access Paper or Ask Questions

HaHeAE: Learning Generalisable Joint Representations of Human Hand and Head Movements in Extended Reality

Oct 21, 2024

Zhiming Hu, Guanhua Zhang, Zheming Yin, Daniel Haeufle, Syn Schmitt, Andreas Bulling

Abstract:Human hand and head movements are the most pervasive input modalities in extended reality (XR) and are significant for a wide range of applications. However, prior works on hand and head modelling in XR only explored a single modality or focused on specific applications. We present HaHeAE - a novel self-supervised method for learning generalisable joint representations of hand and head movements in XR. At the core of our method is an autoencoder (AE) that uses a graph convolutional network-based semantic encoder and a diffusion-based stochastic encoder to learn the joint semantic and stochastic representations of hand-head movements. It also features a diffusion-based decoder to reconstruct the original signals. Through extensive evaluations on three public XR datasets, we show that our method 1) significantly outperforms commonly used self-supervised methods by up to 74.0% in terms of reconstruction quality and is generalisable across users, activities, and XR environments, 2) enables new applications, including interpretable hand-head cluster identification and variable hand-head movement generation, and 3) can serve as an effective feature extractor for downstream tasks. Together, these results demonstrate the effectiveness of our method and underline the potential of self-supervised methods for jointly modelling hand-head behaviours in extended reality.

Via

Access Paper or Ask Questions

Amortized Bayesian Multilevel Models

Aug 23, 2024

Daniel Habermann, Marvin Schmitt, Lars Kühmichel, Andreas Bulling, Stefan T. Radev, Paul-Christian Bürkner

Figure 1 for Amortized Bayesian Multilevel Models

Figure 2 for Amortized Bayesian Multilevel Models

Figure 3 for Amortized Bayesian Multilevel Models

Figure 4 for Amortized Bayesian Multilevel Models

Abstract:Multilevel models (MLMs) are a central building block of the Bayesian workflow. They enable joint, interpretable modeling of data across hierarchical levels and provide a fully probabilistic quantification of uncertainty. Despite their well-recognized advantages, MLMs pose significant computational challenges, often rendering their estimation and evaluation intractable within reasonable time constraints. Recent advances in simulation-based inference offer promising solutions for addressing complex probabilistic models using deep generative networks. However, the utility and reliability of deep learning methods for estimating Bayesian MLMs remains largely unexplored, especially when compared with gold-standard samplers. To this end, we explore a family of neural network architectures that leverage the probabilistic factorization of multilevel models to facilitate efficient neural network training and subsequent near-instant posterior inference on unseen data sets. We test our method on several real-world case studies and provide comprehensive comparisons to Stan as a gold-standard method where possible. Finally, we provide an open-source implementation of our methods to stimulate further research in the nascent field of amortized Bayesian inference.

* 24 pages, 13 figures

Via

Access Paper or Ask Questions

Explicit Modelling of Theory of Mind for Belief Prediction in Nonverbal Social Interactions

Jul 09, 2024

Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, Andreas Bulling

Abstract:We propose MToMnet - a Theory of Mind (ToM) neural network for predicting beliefs and their dynamics during human social interactions from multimodal input. ToM is key for effective nonverbal human communication and collaboration, yet, existing methods for belief modelling have not included explicit ToM modelling or have typically been limited to one or two modalities. MToMnet encodes contextual cues (scene videos and object locations) and integrates them with person-specific cues (human gaze and body language) in a separate MindNet for each person. Inspired by prior research on social cognition and computational ToM, we propose three different MToMnet variants: two involving fusion of latent representations and one involving re-ranking of classification scores. We evaluate our approach on two challenging real-world datasets, one focusing on belief prediction, while the other examining belief dynamics prediction. Our results demonstrate that MToMnet surpasses existing methods by a large margin while at the same time requiring a significantly smaller number of parameters. Taken together, our method opens up a highly promising direction for future work on artificial intelligent systems that can robustly predict human beliefs from their non-verbal behaviour and, as such, more effectively collaborate with humans.

* ECAI 2024

Via

Access Paper or Ask Questions

Multi-Modal Video Dialog State Tracking in the Wild

Jul 02, 2024

Adnen Abdessaied, Lei Shi, Andreas Bulling

Abstract:We present MST-MIXER - a novel video dialog model operating over a generic multi-modal state tracking scheme. Current models that claim to perform multi-modal state tracking fall short of two major aspects: (1) They either track only one modality (mostly the visual input) or (2) they target synthetic datasets that do not reflect the complexity of real-world in the wild scenarios. Our model addresses these two limitations in an attempt to close this crucial research gap. Specifically, MST-MIXER first tracks the most important constituents of each input modality. Then, it predicts the missing underlying structure of the selected constituents of each modality by learning local latent graphs using a novel multi-modal graph structure learning method. Subsequently, the learned local graphs and features are parsed together to form a global graph operating on the mix of all modalities which further refines its structure and node embeddings. Finally, the fine-grained graph node features are used to enhance the hidden states of the backbone Vision-Language Model (VLM). MST-MIXER achieves new state-of-the-art results on five challenging benchmarks.

* ECCV 2024

Via

Access Paper or Ask Questions