Abstract:Multimodal large language models (MLLMs) can simultaneously process visual, textual, and auditory data, capturing insights that complement human analysis. However, existing video question-answering (VidQA) benchmarks and datasets often exhibit a bias toward a single modality, despite the goal of requiring advanced reasoning skills that integrate diverse modalities to answer the queries. In this work, we introduce the modality importance score (MIS) to identify such bias. It is designed to assess which modality embeds the necessary information to answer the question. Additionally, we propose an innovative method using state-of-the-art MLLMs to estimate the modality importance, which can serve as a proxy for human judgments of modality perception. With this MIS, we demonstrate the presence of unimodal bias and the scarcity of genuinely multimodal questions in existing datasets. We further validate the modality importance score with multiple ablation studies to evaluate the performance of MLLMs on permuted feature sets. Our results indicate that current models do not effectively integrate information due to modality imbalance in existing datasets. Our proposed MLLM-derived MIS can guide the curation of modality-balanced datasets that advance multimodal learning and enhance MLLMs' capabilities to understand and utilize synergistic relations across modalities.
Abstract:Purpose: To improve upon Extreme MRI, a recently proposed method by Ong Et al. for reconstructing high spatiotemporal resolution, 3D non-Cartesian acquisitions by incorporating motion compensation into these reconstructions using an approach termed MoCo-MSLR. Methods: Motion compensation is challenging to incorporate into high spatiotemporal resolution reconstruction due to the memory footprint of the motion fields and the potential to lose dynamics by relying on an initial high temporal resolution, low spatial resolution reconstruction. Motivated by the work of Ong Et al. and Huttinga Et al., we estimate low spatial resolution motion fields through a loss enforced in k-space and represent these motion fields in a memory efficient manner using multi-scale low rank components. We interpolate these motion fields to the desired spatial resolution, and then incorporate these fields into Extreme MRI. Results: MoCo-MSLR was able to improve image quality for reconstructions around 500ms temporal resolution and capture bulk motion not seen in Extreme MRI. Further, MoCo-MSLR was able to resolve realistic cardiac dynamics at near 100ms temporal resolution while Extreme MRI struggled to resolve these dynamics. Conclusion: MoCo-MSLR improved image quality over Extreme MRI and was able to resolve both respiratory and cardiac motion in 3D.