Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception

May 24, 2024

Run Luo, Yunshui Li, Longze Chen, Wanwei He, Ting-En Lin, Ziqiang Liu, Lei Zhang, Zikai Song, Xiaobo Xia, Tongliang Liu(+2 more)

Figure 1 for DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception

Figure 2 for DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception

Figure 3 for DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception

Figure 4 for DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception

Share this with someone who'll enjoy it:

Abstract:The development of large language models (LLMs) has significantly advanced the emergence of large multimodal models (LMMs). While LMMs have achieved tremendous success by promoting the synergy between multimodal comprehension and creation, they often face challenges when confronted with out-of-distribution data. This is primarily due to their reliance on image encoders trained to encode images into task-relevant features, which may lead them to disregard irrelevant details. Delving into the modeling capabilities of diffusion models for images naturally prompts the question: Can diffusion models serve as the eyes of large language models for image perception? In this paper, we propose DEEM, a simple and effective approach that utilizes the generative feedback of diffusion models to align the semantic distributions of the image encoder. This addresses the drawbacks of previous methods that solely relied on image encoders like ViT, thereby enhancing the model's resilience against out-of-distribution samples and reducing visual hallucinations. Importantly, this is achieved without requiring additional training modules and with fewer training parameters. We extensively evaluated DEEM on both our newly constructed RobustVQA benchmark and another well-known benchmark, POPE, for object hallucination. Compared to the state-of-the-art interleaved content generation models, DEEM exhibits enhanced robustness and a superior capacity to alleviate model hallucinations while utilizing fewer trainable parameters, less pre-training data (10%), and a smaller base model size.

* 25 pages

View paper on

Share this with someone who'll enjoy it:

Title:DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception

Paper and Code