Abstract:The manual assessment and grading of student writing is a time-consuming yet critical task for teachers. Recent developments in generative AI, such as large language models, offer potential solutions to facilitate essay-scoring tasks for teachers. In our study, we evaluate the performance and reliability of both open-source and closed-source LLMs in assessing German student essays, comparing their evaluations to those of 37 teachers across 10 pre-defined criteria (i.e., plot logic, expression). A corpus of 20 real-world essays from Year 7 and 8 students was analyzed using five LLMs: GPT-3.5, GPT-4, o1, LLaMA 3-70B, and Mixtral 8x7B, aiming to provide in-depth insights into LLMs' scoring capabilities. Closed-source GPT models outperform open-source models in both internal consistency and alignment with human ratings, particularly excelling in language-related criteria. The novel o1 model outperforms all other LLMs, achieving Spearman's $r = .74$ with human assessments in the overall score, and an internal consistency of $ICC=.80$. These findings indicate that LLM-based assessment can be a useful tool to reduce teacher workload by supporting the evaluation of essays, especially with regard to language-related criteria. However, due to their tendency for higher scores, the models require further refinement to better capture aspects of content quality.
Abstract:Real-time object localization on edge devices is fundamental for numerous applications, ranging from surveillance to industrial automation. Traditional frameworks, such as object detection, segmentation, and keypoint detection, struggle in resource-constrained environments, often resulting in substantial target omissions. To address these challenges, we introduce OCDet, a lightweight Object Center Detection framework optimized for edge devices with NPUs. OCDet predicts heatmaps representing object center probabilities and extracts center points through peak identification. Unlike prior methods using fixed Gaussian distribution, we introduce Generalized Centerness (GC) to generate ground truth heatmaps from bounding box annotations, providing finer spatial details without additional manual labeling. Built on NPU-friendly Semantic FPN with MobileNetV4 backbones, OCDet models are trained by our Balanced Continuous Focal Loss (BCFL), which alleviates data imbalance and focuses training on hard negative examples for probability regression tasks. Leveraging the novel Center Alignment Score (CAS) with Hungarian matching, we demonstrate that OCDet consistently outperforms YOLO11 in object center detection, achieving up to 23% higher CAS while requiring 42% fewer parameters, 34% less computation, and 64% lower NPU latency. When compared to keypoint detection frameworks, OCDet achieves substantial CAS improvements up to 186% using identical models. By integrating GC, BCFL, and CAS, OCDet establishes a new paradigm for efficient and robust object center detection on edge devices with NPUs. The code is released at https://github.com/chen-xin-94/ocdet.
Abstract:Recent developments in computer graphics, machine learning, and sensor technologies enable numerous opportunities for extended reality (XR) setups for everyday life, from skills training to entertainment. With large corporations offering consumer-grade head-mounted displays (HMDs) in an affordable way, it is likely that XR will become pervasive, and HMDs will develop as personal devices like smartphones and tablets. However, having intelligent spaces and naturalistic interactions in XR is as important as technological advances so that users grow their engagement in virtual and augmented spaces. To this end, large language model (LLM)--powered non-player characters (NPCs) with speech-to-text (STT) and text-to-speech (TTS) models bring significant advantages over conventional or pre-scripted NPCs for facilitating more natural conversational user interfaces (CUIs) in XR. In this paper, we provide the community with an open-source, customizable, extensible, and privacy-aware Unity package, CUIfy, that facilitates speech-based NPC-user interaction with various LLMs, STT, and TTS models. Our package also supports multiple LLM-powered NPCs per environment and minimizes the latency between different computational models through streaming to achieve usable interactions between users and NPCs. We publish our source code in the following repository: https://gitlab.lrz.de/hctl/cuify
Abstract:Feature engineering is crucial for optimizing machine learning model performance, particularly in tabular data classification tasks. Leveraging advancements in natural language processing, this study presents a systematic approach to enrich tabular datasets with features derived from large language model embeddings. Through a comprehensive ablation study on diverse datasets, we assess the impact of RoBERTa and GPT-2 embeddings on ensemble classifiers, including Random Forest, XGBoost, and CatBoost. Results indicate that integrating embeddings with traditional numerical and categorical features often enhances predictive performance, especially on datasets with class imbalance or limited features and samples, such as UCI Adult, Heart Disease, Titanic, and Pima Indian Diabetes, with improvements particularly notable in XGBoost and CatBoost classifiers. Additionally, feature importance analysis reveals that LLM-derived features frequently rank among the most impactful for the predictions. This study provides a structured approach to embedding-based feature enrichment and illustrates its benefits in ensemble learning for tabular data.
Abstract:We explore the transformative potential of SAM 2, a vision foundation model, in advancing gaze estimation and eye tracking technologies. By significantly reducing annotation time, lowering technical barriers through its ease of deployment, and enhancing segmentation accuracy, SAM 2 addresses critical challenges faced by researchers and practitioners. Utilizing its zero-shot segmentation capabilities with minimal user input-a single click per video-we tested SAM 2 on over 14 million eye images from diverse datasets, including virtual reality setups and the world's largest unified dataset recorded using wearable eye trackers. Remarkably, in pupil segmentation tasks, SAM 2 matches the performance of domain-specific models trained solely on eye images, achieving competitive mean Intersection over Union (mIoU) scores of up to 93% without fine-tuning. Additionally, we provide our code and segmentation masks for these widely used datasets to promote further research.
Abstract:In online education, innovative tools are crucial for enhancing learning outcomes. SAM (Study with AI Mentor) is an advanced platform that integrates educational videos with a context-aware chat interface powered by large language models. SAM encourages students to ask questions and explore unclear concepts in real-time, offering personalized, context-specific assistance, including explanations of formulas, slides, and images. In a crowdsourced user study involving 140 participants, SAM was evaluated through pre- and post-knowledge tests, comparing a group using SAM with a control group. The results demonstrated that SAM users achieved greater knowledge gains, with a 96.8% answer accuracy. Participants also provided positive feedback on SAM's usability and effectiveness. SAM's proactive approach to learning not only enhances learning outcomes but also empowers students to take full ownership of their educational experience, representing a promising future direction for online learning tools.
Abstract:The use of Large Language Models (LLMs) in mathematical reasoning has become a cornerstone of related research, demonstrating the intelligence of these models and enabling potential practical applications through their advanced performance, such as in educational settings. Despite the variety of datasets and in-context learning algorithms designed to improve the ability of LLMs to automate mathematical problem solving, the lack of comprehensive benchmarking across different datasets makes it complicated to select an appropriate model for specific tasks. In this project, we present a benchmark that fairly compares seven state-of-the-art in-context learning algorithms for mathematical problem solving across five widely used mathematical datasets on four powerful foundation models. Furthermore, we explore the trade-off between efficiency and performance, highlighting the practical applications of LLMs for mathematical reasoning. Our results indicate that larger foundation models like GPT-4o and LLaMA 3-70B can solve mathematical reasoning independently from the concrete prompting strategy, while for smaller models the in-context learning approach significantly influences the performance. Moreover, the optimal prompt depends on the chosen foundation model. We open-source our benchmark code to support the integration of additional models in future research.
Abstract:This paper explores the innovative application of Large Language Models (LLMs) in Virtual Reality (VR) environments to promote heritage education, focusing on traditional Scottish curling presented in the game ``Scottish Bonspiel VR''. Our study compares the effectiveness of LLM-based chatbots with pre-defined scripted chatbots, evaluating key criteria such as usability, user engagement, and learning outcomes. The results show that LLM-based chatbots significantly improve interactivity and engagement, creating a more dynamic and immersive learning environment. This integration helps document and preserve cultural heritage and enhances dissemination processes, which are crucial for safeguarding intangible cultural heritage (ICH) amid environmental changes. Furthermore, the study highlights the potential of novel technologies in education to provide immersive experiences that foster a deeper appreciation of cultural heritage. These findings support the wider application of LLMs and VR in cultural education to address global challenges and promote sustainable practices to preserve and enhance cultural heritage.
Abstract:Swift and accurate detection of specified objects is crucial for many industrial applications, such as safety monitoring on construction sites. However, traditional approaches rely heavily on arduous manual annotation and data collection, which struggle to adapt to ever-changing environments and novel target objects. To address these limitations, this paper presents DART, an automated end-to-end pipeline designed to streamline the entire workflow of an object detection application from data collection to model deployment. DART eliminates the need for human labeling and extensive data collection while excelling in diverse scenarios. It employs a subject-driven image generation module (DreamBooth with SDXL) for data diversification, followed by an annotation stage where open-vocabulary object detection (Grounding DINO) generates bounding box annotations for both generated and original images. These pseudo-labels are then reviewed by a large multimodal model (GPT-4o) to guarantee credibility before serving as ground truth to train real-time object detectors (YOLO). We apply DART to a self-collected dataset of construction machines named Liebherr Product, which contains over 15K high-quality images across 23 categories. The current implementation of DART significantly increases average precision (AP) from 0.064 to 0.832. Furthermore, we adopt a modular design for DART to ensure easy exchangeability and extensibility. This allows for a smooth transition to more advanced algorithms in the future, seamless integration of new object categories without manual labeling, and adaptability to customized environments without extra data collection. The code and dataset are released at https://github.com/chen-xin-94/DART.
Abstract:In recent years, model explanation methods have been designed to interpret model decisions faithfully and intuitively so that users can easily understand them. In this paper, we propose a framework, Faithful Attention Explainer (FAE), capable of generating faithful textual explanations regarding the attended-to features. Towards this goal, we deploy an attention module that takes the visual feature maps from the classifier for sentence generation. Furthermore, our method successfully learns the association between features and words, which allows a novel attention enforcement module for attention explanation. Our model achieves promising performance in caption quality metrics and a faithful decision-relevance metric on two datasets (CUB and ACT-X). In addition, we show that FAE can interpret gaze-based human attention, as human gaze indicates the discriminative features that humans use for decision-making, demonstrating the potential of deploying human gaze for advanced human-AI interaction.