Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Can Large Multimodal Models Uncover Deep Semantics Behind Images?

Feb 17, 2024

Yixin Yang, Zheng Li, Qingxiu Dong, Heming Xia, Zhifang Sui

Figure 1 for Can Large Multimodal Models Uncover Deep Semantics Behind Images?

Figure 2 for Can Large Multimodal Models Uncover Deep Semantics Behind Images?

Figure 3 for Can Large Multimodal Models Uncover Deep Semantics Behind Images?

Figure 4 for Can Large Multimodal Models Uncover Deep Semantics Behind Images?

Share this with someone who'll enjoy it:

Abstract:Understanding the deep semantics of images is essential in the era dominated by social media. However, current research works primarily on the superficial description of images, revealing a notable deficiency in the systematic investigation of the inherent deep semantics. In this work, we introduce DEEPEVAL, a comprehensive benchmark to assess Large Multimodal Models' (LMMs) capacities of visual deep semantics. DEEPEVAL includes human-annotated dataset and three progressive subtasks: fine-grained description selection, in-depth title matching, and deep semantics understanding. Utilizing DEEPEVAL, we evaluate 9 open-source LMMs and GPT-4V(ision).Our evaluation demonstrates a substantial gap between the deep semantic comprehension capabilities of existing LMMs and humans. For example, GPT-4V is 30% behind humans in understanding deep semantics, even though it achieves human-comparable performance in image description. Further analysis indicates that the integration of description texts during the inference process notably enhances LMMs' ability to perceive deep semantics. Furthermore, our dataset is divided into multiple categories, and we conducted a more detailed analysis within these categories.

View paper on

Share this with someone who'll enjoy it:

Title:Can Large Multimodal Models Uncover Deep Semantics Behind Images?

Paper and Code