Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jinwoo Ahn

Fine-Grained Open-Vocabulary Object Recognition via User-Guided Segmentation

Nov 23, 2024

Jinwoo Ahn, Hyeokjoon Kwon, Hwiyeon Yoo

Figure 1 for Fine-Grained Open-Vocabulary Object Recognition via User-Guided Segmentation

Figure 2 for Fine-Grained Open-Vocabulary Object Recognition via User-Guided Segmentation

Figure 3 for Fine-Grained Open-Vocabulary Object Recognition via User-Guided Segmentation

Figure 4 for Fine-Grained Open-Vocabulary Object Recognition via User-Guided Segmentation

Abstract:Recent advent of vision-based foundation models has enabled efficient and high-quality object detection at ease. Despite the success of previous studies, object detection models face limitations on capturing small components from holistic objects and taking user intention into account. To address these challenges, we propose a novel foundation model-based detection method called FOCUS: Fine-grained Open-Vocabulary Object ReCognition via User-Guided Segmentation. FOCUS merges the capabilities of vision foundation models to automate open-vocabulary object detection at flexible granularity and allow users to directly guide the detection process via natural language. It not only excels at identifying and locating granular constituent elements but also minimizes unnecessary user intervention yet grants them significant control. With FOCUS, users can make explainable requests to actively guide the detection process in the intended direction. Our results show that FOCUS effectively enhances the detection capabilities of baseline models and shows consistent performance across varying object types.

Via

Access Paper or Ask Questions

Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset

Nov 21, 2024

Heejeong Nam, Jinwoo Ahn

Figure 1 for Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset

Figure 2 for Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset

Figure 3 for Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset

Figure 4 for Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset

Abstract:The ability to perform complex reasoning across multimodal inputs is essential for models to effectively interact with humans in real-world scenarios. Advancements in vision-language models have significantly improved performance on tasks that require processing explicit and direct textual inputs, such as Visual Question Answering (VQA) and Visual Grounding (VG). However, less attention has been given to improving the model capabilities to comprehend nuanced and ambiguous forms of communication. This presents a critical challenge, as human language in real-world interactions often convey hidden intentions that rely on context for accurate interpretation. To address this gap, we propose VAGUE, a multimodal benchmark comprising 3.9K indirect human utterances paired with corresponding scenes. Additionally, we contribute a model-based pipeline for generating prompt-solution pairs from input images. Our work aims to delve deeper into the ability of models to understand indirect communication and seek to contribute to the development of models capable of more refined and human-like interactions. Extensive evaluation on multiple VLMs reveals that mainstream models still struggle with indirect communication when required to perform complex linguistic and visual reasoning. We release our code and data at https://github.com/Hazel-Heejeong-Nam/VAGUE.git.

Via

Access Paper or Ask Questions

Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024

Jun 10, 2024

Jinwoo Ahn, Junhyeok Park, Min-Jun Kim, Kang-Hyeon Kim, So-Yeong Sohn, Yun-Ji Lee, Du-Seong Chang, Yu-Jung Heo, Eun-Sol Kim

Figure 1 for Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024

Figure 2 for Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024

Figure 3 for Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024

Figure 4 for Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024

Abstract:In this paper, the solution of HYU MLLAB KT Team to the Multimodal Algorithmic Reasoning Task: SMART-101 CVPR 2024 Challenge is presented. Beyond conventional visual question-answering problems, the SMART-101 challenge aims to achieve human-level multimodal understanding by tackling complex visio-linguistic puzzles designed for children in the 6-8 age group. To solve this problem, we suggest two main ideas. First, to utilize the reasoning ability of a large-scale language model (LLM), the given visual cues (images) are grounded in the text modality. For this purpose, we generate highly detailed text captions that describe the context of the image and use these captions as input for the LLM. Second, due to the nature of puzzle images, which often contain various geometric visual patterns, we utilize an object detection algorithm to ensure these patterns are not overlooked in the captioning process. We employed the SAM algorithm, which can detect various-size objects, to capture the visual features of these geometric patterns and used this information as input for the LLM. Under the puzzle split configuration, we achieved an option selection accuracy Oacc of 29.5 on the test set and a weighted option selection accuracy (WOSA) of 27.1 on the challenge set.

Via

Access Paper or Ask Questions

Chain-of-Feedback: Mitigating the Effects of Inconsistency in Responses

Feb 05, 2024

Jinwoo Ahn

Abstract:Large Language Models (LLMs) frequently suffer from knowledge-intensive questions, often being inconsistent by providing different outputs despite given the same input. The response quality worsens when the user expresses a firm opposing stance which causes the LLMs to adjust its response despite the correct initial one. These behaviors decrease the reliability and validity of the responses provided by these models. In this paper, we attempt to 1) raise awareness of the inherent risks that follow from overly relying on AI agents like ChatGPT by showing how Chain-of-Feedback (CoF) triggers LLMs to deviate more from the actual answer and 2) suggest a novel prompting method, Recursive Chain of Feedback (R-CoF), that we are conducting further study. The CoF system takes in an open-ended multi-step question. Then, we repetitively provide meaningless feedback requesting another attempt. Our preliminary experiments show that such feedback only decreases the quality of the response. On the other hand, to mitigate the effects of the aforementioned inconsistencies, we present a novel method of recursively revising the initial incorrect reasoning provided by the LLM by repetitively breaking down each incorrect step into smaller individual problems.

* Still Ongoing Work

Via

Access Paper or Ask Questions

Goal Driven Discovery of Distributional Differences via Language Descriptions

Feb 28, 2023

Ruiqi Zhong, Peter Zhang, Steve Li, Jinwoo Ahn, Dan Klein, Jacob Steinhardt

Figure 1 for Goal Driven Discovery of Distributional Differences via Language Descriptions

Figure 2 for Goal Driven Discovery of Distributional Differences via Language Descriptions

Figure 3 for Goal Driven Discovery of Distributional Differences via Language Descriptions

Figure 4 for Goal Driven Discovery of Distributional Differences via Language Descriptions

Abstract:Mining large corpora can generate useful discoveries but is time-consuming for humans. We formulate a new task, D5, that automatically discovers differences between two large corpora in a goal-driven way. The task input is a problem comprising a research goal "$\textit{comparing the side effects of drug A and drug B}$" and a corpus pair (two large collections of patients' self-reported reactions after taking each drug). The output is a language description (discovery) of how these corpora differ (patients taking drug A "$\textit{mention feelings of paranoia}$" more often). We build a D5 system, and to quantitatively measure its performance, we 1) contribute a meta-dataset, OpenD5, aggregating 675 open-ended problems ranging across business, social sciences, humanities, machine learning, and health, and 2) propose a set of unified evaluation metrics: validity, relevance, novelty, and significance. With the dataset and the unified metrics, we confirm that language models can use the goals to propose more relevant, novel, and significant candidate discoveries. Finally, our system produces discoveries previously unknown to the authors on a wide range of applications in OpenD5, including temporal and demographic differences in discussion topics, political stances and stereotypes in speech, insights in commercial reviews, and error patterns in NLP models.

Via

Access Paper or Ask Questions