Abstract:Recent technological advancements have enhanced our ability to collect and analyze rich multimodal data (e.g., speech, video, and eye gaze) to better inform learning and training experiences. While previous reviews have focused on parts of the multimodal pipeline (e.g., conceptual models and data fusion), a comprehensive literature review on the methods informing multimodal learning and training environments has not been conducted. This literature review provides an in-depth analysis of research methods in these environments, proposing a taxonomy and framework that encapsulates recent methodological advances in this field and characterizes the multimodal domain in terms of five modality groups: Natural Language, Video, Sensors, Human-Centered, and Environment Logs. We introduce a novel data fusion category -- mid fusion -- and a graph-based technique for refining literature reviews, termed citation graph pruning. Our analysis reveals that leveraging multiple modalities offers a more holistic understanding of the behaviors and outcomes of learners and trainees. Even when multimodality does not enhance predictive accuracy, it often uncovers patterns that contextualize and elucidate unimodal data, revealing subtleties that a single modality may miss. However, there remains a need for further research to bridge the divide between multimodal learning and training studies and foundational AI research.
Abstract:Cochlear Implant (CI) procedures involve inserting an array of electrodes into the cochlea located inside the inner ear. Mastoidectomy is a surgical procedure that uses a high-speed drill to remove part of the mastoid region of the temporal bone, providing safe access to the cochlea through the middle and inner ear. We aim to develop an intraoperative navigation system that registers plans created using 3D preoperative Computerized Tomography (CT) volumes with the 2D surgical microscope view. Herein, we propose a method to synthesize the mastoidectomy volume using only the preoperative CT scan, where the mastoid is intact. We introduce an unsupervised learning framework designed to synthesize mastoidectomy. For model training purposes, this method uses postoperative CT scans to avoid manual data cleaning or labeling, even when the region removed during mastoidectomy is visible but affected by metal artifacts, low signal-to-noise ratio, or electrode wiring. Our approach estimates mastoidectomy regions with a mean dice score of 70.0%. This approach represents a major step forward for CI intraoperative navigation by predicting realistic mastoidectomy-removed regions in preoperative planning that can be used to register the pre-surgery plan to intraoperative microscopy.
Abstract:This study presents a novel framework for 3D gaze tracking tailored for mixed-reality settings, aimed at enhancing joint attention and collaborative efforts in team-based scenarios. Conventional gaze tracking, often limited by monocular cameras and traditional eye-tracking apparatus, struggles with simultaneous data synchronization and analysis from multiple participants in group contexts. Our proposed framework leverages state-of-the-art computer vision and machine learning techniques to overcome these obstacles, enabling precise 3D gaze estimation without dependence on specialized hardware or complex data fusion. Utilizing facial recognition and deep learning, the framework achieves real-time, tracking of gaze patterns across several individuals, addressing common depth estimation errors, and ensuring spatial and identity consistency within the dataset. Empirical results demonstrate the accuracy and reliability of our method in group environments. This provides mechanisms for significant advances in behavior and interaction analysis in educational and professional training applications in dynamic and unstructured environments.
Abstract:The primary focus of this paper is the development of a framework for pose and size estimation of unseen objects given a single RGB image - all in real-time. In 2019, the first category-level pose and size estimation framework was proposed alongside two novel datasets called CAMERA and REAL. However, current methodologies are restricted from practical use because of its long inference time (2-4 fps). Their approach's inference had significant delays because they used the computationally expensive MaskedRCNN framework and Umeyama algorithm. To optimize our method and yield real-time results, our framework uses the efficient ResNet-FPN framework alongside decoupling the translation, rotation, and size regression problem by using distinct decoders. Moreover, our methodology performs pose and size estimation in a global context - i.e., estimating the involved parameters of all captured objects in the image all at once. We perform extensive testing to fully compare the performance in terms of precision and speed to demonstrate the capability of our method.
Abstract:Investigating children's embodied learning in mixed-reality environments, where they collaboratively simulate scientific processes, requires analyzing complex multimodal data to interpret their learning and coordination behaviors. Learning scientists have developed Interaction Analysis (IA) methodologies for analyzing such data, but this requires researchers to watch hours of videos to extract and interpret students' learning patterns. Our study aims to simplify researchers' tasks, using Machine Learning and Multimodal Learning Analytics to support the IA processes. Our study combines machine learning algorithms and multimodal analyses to support and streamline researcher efforts in developing a comprehensive understanding of students' scientific engagement through their movements, gaze, and affective responses in a simulated scenario. To facilitate an effective researcher-AI partnership, we present an initial case study to determine the feasibility of visually representing students' states, actions, gaze, affect, and movement on a timeline. Our case study focuses on a specific science scenario where students learn about photosynthesis. The timeline allows us to investigate the alignment of critical learning moments identified by multimodal and interaction analysis, and uncover insights into students' temporal learning progressions.
Abstract:For those experiencing severe-to-profound sensorineural hearing loss, the cochlear implant (CI) is the preferred treatment. Augmented reality (AR) aided surgery can potentially improve CI procedures and hearing outcomes. Typically, AR solutions for image-guided surgery rely on optical tracking systems to register pre-operative planning information to the display so that hidden anatomy or other important information can be overlayed and co-registered with the view of the surgical scene. In this paper, our goal is to develop a method that permits direct 2D-to-3D registration of the microscope video to the pre-operative Computed Tomography (CT) scan without the need for external tracking equipment. Our proposed solution involves using surface mapping of a portion of the incus in surgical recordings and determining the pose of this structure relative to the surgical microscope by performing pose estimation via the perspective-n-point (PnP) algorithm. This registration can then be applied to pre-operative segmentations of other anatomy-of-interest, as well as the planned electrode insertion trajectory to co-register this information for the AR display. Our results demonstrate the accuracy with an average rotation error of less than 25 degrees and a translation error of less than 2 mm, 3 mm, and 0.55% for the x, y, and z axes, respectively. Our proposed method has the potential to be applicable and generalized to other surgical procedures while only needing a monocular microscope during intra-operation.