Abstract:Recent developments in computer graphics, machine learning, and sensor technologies enable numerous opportunities for extended reality (XR) setups for everyday life, from skills training to entertainment. With large corporations offering consumer-grade head-mounted displays (HMDs) in an affordable way, it is likely that XR will become pervasive, and HMDs will develop as personal devices like smartphones and tablets. However, having intelligent spaces and naturalistic interactions in XR is as important as technological advances so that users grow their engagement in virtual and augmented spaces. To this end, large language model (LLM)--powered non-player characters (NPCs) with speech-to-text (STT) and text-to-speech (TTS) models bring significant advantages over conventional or pre-scripted NPCs for facilitating more natural conversational user interfaces (CUIs) in XR. In this paper, we provide the community with an open-source, customizable, extensible, and privacy-aware Unity package, CUIfy, that facilitates speech-based NPC-user interaction with various LLMs, STT, and TTS models. Our package also supports multiple LLM-powered NPCs per environment and minimizes the latency between different computational models through streaming to achieve usable interactions between users and NPCs. We publish our source code in the following repository: https://gitlab.lrz.de/hctl/cuify
Abstract:Feature engineering is crucial for optimizing machine learning model performance, particularly in tabular data classification tasks. Leveraging advancements in natural language processing, this study presents a systematic approach to enrich tabular datasets with features derived from large language model embeddings. Through a comprehensive ablation study on diverse datasets, we assess the impact of RoBERTa and GPT-2 embeddings on ensemble classifiers, including Random Forest, XGBoost, and CatBoost. Results indicate that integrating embeddings with traditional numerical and categorical features often enhances predictive performance, especially on datasets with class imbalance or limited features and samples, such as UCI Adult, Heart Disease, Titanic, and Pima Indian Diabetes, with improvements particularly notable in XGBoost and CatBoost classifiers. Additionally, feature importance analysis reveals that LLM-derived features frequently rank among the most impactful for the predictions. This study provides a structured approach to embedding-based feature enrichment and illustrates its benefits in ensemble learning for tabular data.
Abstract:We explore the transformative potential of SAM 2, a vision foundation model, in advancing gaze estimation and eye tracking technologies. By significantly reducing annotation time, lowering technical barriers through its ease of deployment, and enhancing segmentation accuracy, SAM 2 addresses critical challenges faced by researchers and practitioners. Utilizing its zero-shot segmentation capabilities with minimal user input-a single click per video-we tested SAM 2 on over 14 million eye images from diverse datasets, including virtual reality setups and the world's largest unified dataset recorded using wearable eye trackers. Remarkably, in pupil segmentation tasks, SAM 2 matches the performance of domain-specific models trained solely on eye images, achieving competitive mean Intersection over Union (mIoU) scores of up to 93% without fine-tuning. Additionally, we provide our code and segmentation masks for these widely used datasets to promote further research.
Abstract:In online education, innovative tools are crucial for enhancing learning outcomes. SAM (Study with AI Mentor) is an advanced platform that integrates educational videos with a context-aware chat interface powered by large language models. SAM encourages students to ask questions and explore unclear concepts in real-time, offering personalized, context-specific assistance, including explanations of formulas, slides, and images. In a crowdsourced user study involving 140 participants, SAM was evaluated through pre- and post-knowledge tests, comparing a group using SAM with a control group. The results demonstrated that SAM users achieved greater knowledge gains, with a 96.8% answer accuracy. Participants also provided positive feedback on SAM's usability and effectiveness. SAM's proactive approach to learning not only enhances learning outcomes but also empowers students to take full ownership of their educational experience, representing a promising future direction for online learning tools.
Abstract:The use of Large Language Models (LLMs) in mathematical reasoning has become a cornerstone of related research, demonstrating the intelligence of these models and enabling potential practical applications through their advanced performance, such as in educational settings. Despite the variety of datasets and in-context learning algorithms designed to improve the ability of LLMs to automate mathematical problem solving, the lack of comprehensive benchmarking across different datasets makes it complicated to select an appropriate model for specific tasks. In this project, we present a benchmark that fairly compares seven state-of-the-art in-context learning algorithms for mathematical problem solving across five widely used mathematical datasets on four powerful foundation models. Furthermore, we explore the trade-off between efficiency and performance, highlighting the practical applications of LLMs for mathematical reasoning. Our results indicate that larger foundation models like GPT-4o and LLaMA 3-70B can solve mathematical reasoning independently from the concrete prompting strategy, while for smaller models the in-context learning approach significantly influences the performance. Moreover, the optimal prompt depends on the chosen foundation model. We open-source our benchmark code to support the integration of additional models in future research.
Abstract:This paper explores the innovative application of Large Language Models (LLMs) in Virtual Reality (VR) environments to promote heritage education, focusing on traditional Scottish curling presented in the game ``Scottish Bonspiel VR''. Our study compares the effectiveness of LLM-based chatbots with pre-defined scripted chatbots, evaluating key criteria such as usability, user engagement, and learning outcomes. The results show that LLM-based chatbots significantly improve interactivity and engagement, creating a more dynamic and immersive learning environment. This integration helps document and preserve cultural heritage and enhances dissemination processes, which are crucial for safeguarding intangible cultural heritage (ICH) amid environmental changes. Furthermore, the study highlights the potential of novel technologies in education to provide immersive experiences that foster a deeper appreciation of cultural heritage. These findings support the wider application of LLMs and VR in cultural education to address global challenges and promote sustainable practices to preserve and enhance cultural heritage.
Abstract:Swift and accurate detection of specified objects is crucial for many industrial applications, such as safety monitoring on construction sites. However, traditional approaches rely heavily on arduous manual annotation and data collection, which struggle to adapt to ever-changing environments and novel target objects. To address these limitations, this paper presents DART, an automated end-to-end pipeline designed to streamline the entire workflow of an object detection application from data collection to model deployment. DART eliminates the need for human labeling and extensive data collection while excelling in diverse scenarios. It employs a subject-driven image generation module (DreamBooth with SDXL) for data diversification, followed by an annotation stage where open-vocabulary object detection (Grounding DINO) generates bounding box annotations for both generated and original images. These pseudo-labels are then reviewed by a large multimodal model (GPT-4o) to guarantee credibility before serving as ground truth to train real-time object detectors (YOLO). We apply DART to a self-collected dataset of construction machines named Liebherr Product, which contains over 15K high-quality images across 23 categories. The current implementation of DART significantly increases average precision (AP) from 0.064 to 0.832. Furthermore, we adopt a modular design for DART to ensure easy exchangeability and extensibility. This allows for a smooth transition to more advanced algorithms in the future, seamless integration of new object categories without manual labeling, and adaptability to customized environments without extra data collection. The code and dataset are released at https://github.com/chen-xin-94/DART.
Abstract:In recent years, model explanation methods have been designed to interpret model decisions faithfully and intuitively so that users can easily understand them. In this paper, we propose a framework, Faithful Attention Explainer (FAE), capable of generating faithful textual explanations regarding the attended-to features. Towards this goal, we deploy an attention module that takes the visual feature maps from the classifier for sentence generation. Furthermore, our method successfully learns the association between features and words, which allows a novel attention enforcement module for attention explanation. Our model achieves promising performance in caption quality metrics and a faithful decision-relevance metric on two datasets (CUB and ACT-X). In addition, we show that FAE can interpret gaze-based human attention, as human gaze indicates the discriminative features that humans use for decision-making, demonstrating the potential of deploying human gaze for advanced human-AI interaction.
Abstract:Eye-tracking applications that utilize the human gaze in video understanding tasks have become increasingly important. To effectively automate the process of video analysis based on eye-tracking data, it is important to accurately replicate human gaze behavior. However, this task presents significant challenges due to the inherent complexity and ambiguity of human gaze patterns. In this work, we introduce a novel method for simulating human gaze behavior. Our approach uses a transformer-based reinforcement learning algorithm to train an agent that acts as a human observer, with the primary role of watching videos and simulating human gaze behavior. We employed an eye-tracking dataset gathered from videos generated by the VirtualHome simulator, with a primary focus on activity recognition. Our experimental results demonstrate the effectiveness of our gaze prediction method by highlighting its capability to replicate human gaze behavior and its applicability for downstream tasks where real human-gaze is used as input.
Abstract:Humans utilize their gaze to concentrate on essential information while perceiving and interpreting intentions in videos. Incorporating human gaze into computational algorithms can significantly enhance model performance in video understanding tasks. In this work, we address a challenging and innovative task in video understanding: predicting the actions of an agent in a video based on a partial video. We introduce the Gaze-guided Action Anticipation algorithm, which establishes a visual-semantic graph from the video input. Our method utilizes a Graph Neural Network to recognize the agent's intention and predict the action sequence to fulfill this intention. To assess the efficiency of our approach, we collect a dataset containing household activities generated in the VirtualHome environment, accompanied by human gaze data of viewing videos. Our method outperforms state-of-the-art techniques, achieving a 7\% improvement in accuracy for 18-class intention recognition. This highlights the efficiency of our method in learning important features from human gaze data.