Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jaehyuk Jang

Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models

May 28, 2024

Sangmin Woo, Donguk Kim, Jaehyuk Jang, Yubin Choi, Changick Kim

Abstract:This study addresses the issue observed in Large Vision Language Models (LVLMs), where excessive attention on a few image tokens, referred to as blind tokens, leads to hallucinatory responses in tasks requiring fine-grained understanding of visual objects. We found that tokens receiving lower attention weights often hold essential information for identifying nuanced object details -- ranging from merely recognizing object existence to identifying their attributes (color, position, etc.) and understanding their relationships. To counteract the over-emphasis on blind tokens and to accurately respond to user queries, we introduce a technique called Attentional Vision Calibration (AVC). During the decoding phase, AVC identifies blind tokens by analyzing the image-related attention distribution. It then dynamically adjusts the logits for the next token prediction by contrasting the logits conditioned on the original visual tokens with those conditioned on the blind tokens. This effectively lowers the dependency on blind tokens and promotes a more balanced consideration of all tokens. We validate AVC on benchmarks such as POPE, MME, and AMBER, where it consistently outperforms existing decoding techniques in mitigating object hallucinations in LVLMs.

* Project page: https://sangminwoo.github.io/AvisC/

Via

Access Paper or Ask Questions

RITUAL: Random Image Transformations as a Universal Anti-hallucination Lever in LVLMs

May 28, 2024

Sangmin Woo, Jaehyuk Jang, Donguk Kim, Yubin Choi, Changick Kim

Figure 1 for RITUAL: Random Image Transformations as a Universal Anti-hallucination Lever in LVLMs

Figure 2 for RITUAL: Random Image Transformations as a Universal Anti-hallucination Lever in LVLMs

Figure 3 for RITUAL: Random Image Transformations as a Universal Anti-hallucination Lever in LVLMs

Figure 4 for RITUAL: Random Image Transformations as a Universal Anti-hallucination Lever in LVLMs

Abstract:Recent advancements in Large Vision Language Models (LVLMs) have revolutionized how machines understand and generate textual responses based on visual inputs. Despite their impressive capabilities, they often produce "hallucinatory" outputs that do not accurately reflect the visual information, posing challenges in reliability and trustworthiness. Current methods such as contrastive decoding have made strides in addressing these issues by contrasting the original probability distribution of generated tokens with distorted counterparts; yet, generating visually-faithful outputs remains a challenge. In this work, we shift our focus to the opposite: What could serve as a complementary enhancement to the original probability distribution? We propose a simple, training-free method termed RITUAL to enhance robustness against hallucinations in LVLMs. Our approach employs random image transformations as complements to the original probability distribution, aiming to mitigate the likelihood of hallucinatory visual explanations by enriching the model's exposure to varied visual scenarios. Our empirical results show that while the isolated use of transformed images initially degrades performance, strategic implementation of these transformations can indeed serve as effective complements. Notably, our method is compatible with current contrastive decoding methods and does not require external models or costly self-feedback mechanisms, making it a practical addition. In experiments, RITUAL significantly outperforms existing contrastive decoding methods across several object hallucination benchmarks, including POPE, CHAIR, and MME.

* Project page: https://sangminwoo.github.io/RITUAL/

Via

Access Paper or Ask Questions

Towards Robust Multimodal Prompting With Missing Modalities

Dec 27, 2023

Jaehyuk Jang, Yooseung Wang, Changick Kim

Figure 1 for Towards Robust Multimodal Prompting With Missing Modalities

Figure 2 for Towards Robust Multimodal Prompting With Missing Modalities

Figure 3 for Towards Robust Multimodal Prompting With Missing Modalities

Figure 4 for Towards Robust Multimodal Prompting With Missing Modalities

Abstract:Recently, multimodal prompting, which introduces learnable missing-aware prompts for all missing modality cases, has exhibited impressive performance. However, it encounters two critical issues: 1) The number of prompts grows exponentially as the number of modalities increases; and 2) It lacks robustness in scenarios with different missing modality settings between training and inference. In this paper, we propose a simple yet effective prompt design to address these challenges. Instead of using missing-aware prompts, we utilize prompts as modality-specific tokens, enabling them to capture the unique characteristics of each modality. Furthermore, our prompt design leverages orthogonality between prompts as a key element to learn distinct information across different modalities and promote diversity in the learned representations. Extensive experiments demonstrate that our prompt design enhances both performance and robustness while reducing the number of prompts.

* Accepted to ICASSP 2024

Via

Access Paper or Ask Questions