Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Oucheng Huang

TraDiffusion: Trajectory-Based Training-Free Image Generation

Aug 19, 2024

Mingrui Wu, Oucheng Huang, Jiayi Ji, Jiale Li, Xinyue Cai, Huafeng Kuang, Jianzhuang Liu, Xiaoshuai Sun, Rongrong Ji

Figure 1 for TraDiffusion: Trajectory-Based Training-Free Image Generation

Figure 2 for TraDiffusion: Trajectory-Based Training-Free Image Generation

Figure 3 for TraDiffusion: Trajectory-Based Training-Free Image Generation

Figure 4 for TraDiffusion: Trajectory-Based Training-Free Image Generation

Abstract:In this work, we propose a training-free, trajectory-based controllable T2I approach, termed TraDiffusion. This novel method allows users to effortlessly guide image generation via mouse trajectories. To achieve precise control, we design a distance awareness energy function to effectively guide latent variables, ensuring that the focus of generation is within the areas defined by the trajectory. The energy function encompasses a control function to draw the generation closer to the specified trajectory and a movement function to diminish activity in areas distant from the trajectory. Through extensive experiments and qualitative assessments on the COCO dataset, the results reveal that TraDiffusion facilitates simpler, more natural image control. Moreover, it showcases the ability to manipulate salient regions, attributes, and relationships within the generated images, alongside visual input based on arbitrary or enhanced trajectories.

* The code: https://github.com/och-mac/TraDiffusion

Via

Access Paper or Ask Questions

ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models

Jul 31, 2024

Mingrui Wu, Xinyue Cai, Jiayi Ji, Jiale Li, Oucheng Huang, Gen Luo, Hao Fei, Xiaoshuai Sun, Rongrong Ji

Abstract:In this work, we propose a training-free method to inject visual referring into Multimodal Large Language Models (MLLMs) through learnable visual token optimization. We observe the relationship between text prompt tokens and visual tokens in MLLMs, where attention layers model the connection between them. Our approach involves adjusting visual tokens from the MLP output during inference, controlling which text prompt tokens attend to which visual tokens. We optimize a learnable visual token based on an energy function, enhancing the strength of referential regions in the attention map. This enables detailed region description and reasoning without the need for substantial training costs or model retraining. Our method offers a promising direction for integrating referential abilities into MLLMs. Our method support referring with box, mask, scribble and point. The results demonstrate that our method exhibits controllability and interpretability.

* Code:https://github.com/mrwu-mac/ControlMLLM

Via

Access Paper or Ask Questions

Evaluating and Analyzing Relationship Hallucinations in LVLMs

Jun 24, 2024

Mingrui Wu, Jiayi Ji, Oucheng Huang, Jiale Li, Yuhang Wu, Xiaoshuai Sun, Rongrong Ji

Figure 1 for Evaluating and Analyzing Relationship Hallucinations in LVLMs

Figure 2 for Evaluating and Analyzing Relationship Hallucinations in LVLMs

Figure 3 for Evaluating and Analyzing Relationship Hallucinations in LVLMs

Figure 4 for Evaluating and Analyzing Relationship Hallucinations in LVLMs

Abstract:The issue of hallucinations is a prevalent concern in existing Large Vision-Language Models (LVLMs). Previous efforts have primarily focused on investigating object hallucinations, which can be easily alleviated by introducing object detectors. However, these efforts neglect hallucinations in inter-object relationships, which is essential for visual comprehension. In this work, we introduce R-Bench, a novel benchmark for evaluating Vision Relationship Hallucination. R-Bench features image-level questions that focus on the existence of relationships and instance-level questions that assess local visual comprehension. We identify three types of relationship co-occurrences that lead to hallucinations: relationship-relationship, subject-relationship, and relationship-object. The visual instruction tuning dataset's long-tail distribution significantly impacts LVLMs' understanding of visual relationships. Furthermore, our analysis reveals that current LVLMs tend to disregard visual content and overly rely on the common sense knowledge of Large Language Models. They also struggle with reasoning about spatial relationships based on contextual information.

* ICML2024

Via

Access Paper or Ask Questions