Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Feb 06, 2025

Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Weidi Xie

Figure 1 for WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Figure 2 for WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Figure 3 for WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Figure 4 for WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Share this with someone who'll enjoy it:

Abstract:In this paper, we introduce WorldSense, the first benchmark to assess the multi-modal video understanding, that simultaneously encompasses visual, audio, and text inputs. In contrast to existing benchmarks, our WorldSense has several features: (i) collaboration of omni-modality, we design the evaluation tasks to feature a strong coupling of audio and video, requiring models to effectively utilize the synergistic perception of omni-modality; (ii) diversity of videos and tasks, WorldSense encompasses a diverse collection of 1,662 audio-visual synchronised videos, systematically categorized into 8 primary domains and 67 fine-grained subcategories to cover the broad scenarios, and 3,172 multi-choice QA pairs across 26 distinct tasks to enable the comprehensive evaluation; (iii) high-quality annotations, all the QA pairs are manually labeled by 80 expert annotators with multiple rounds of correction to ensure quality. Based on our WorldSense, we extensively evaluate various state-of-the-art models. The experimental results indicate that existing models face significant challenges in understanding real-world scenarios (48.0% best accuracy). We hope our WorldSense can provide a platform for evaluating the ability in constructing and understanding coherent contexts from omni-modality.

View paper on

Share this with someone who'll enjoy it:

Title:WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Paper and Code