Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Min Song

Retrieval Visual Contrastive Decoding to Mitigate Object Hallucinations in Large Vision-Language Models

May 29, 2025

Jihoon Lee, Min Song

Abstract:Despite significant advancements in Large Vision-Language Models, Object Hallucination (OH) remains a persistent challenge. Building upon prior studies on contrastive decoding that address this issue without requiring additional model training, we introduce RVCD (Retrieval Visual Contrastive Decoding), an advanced method to suppress OH. RVCD leverages both negative and positive images at the logit level, explicitly referencing AI-generated images designed to represent a single concept. Our approach demonstrates substantial improvements over existing decoding-based methods.

* ACL 2025 Findings camera-ready version. Code is released at https://github.com/JiHoonLee9898/RVCD

Via

Access Paper or Ask Questions

Def-DTS: Deductive Reasoning for Open-domain Dialogue Topic Segmentation

May 27, 2025

Seungmin Lee, Yongsang Yoo, Minhwa Jung, Min Song

Abstract:Dialogue Topic Segmentation (DTS) aims to divide dialogues into coherent segments. DTS plays a crucial role in various NLP downstream tasks, but suffers from chronic problems: data shortage, labeling ambiguity, and incremental complexity of recently proposed solutions. On the other hand, Despite advances in Large Language Models (LLMs) and reasoning strategies, these have rarely been applied to DTS. This paper introduces Def-DTS: Deductive Reasoning for Open-domain Dialogue Topic Segmentation, which utilizes LLM-based multi-step deductive reasoning to enhance DTS performance and enable case study using intermediate result. Our method employs a structured prompting approach for bidirectional context summarization, utterance intent classification, and deductive topic shift detection. In the intent classification process, we propose the generalizable intent list for domain-agnostic dialogue intent classification. Experiments in various dialogue settings demonstrate that Def-DTS consistently outperforms traditional and state-of-the-art approaches, with each subtask contributing to improved performance, particularly in reducing type 2 error. We also explore the potential for autolabeling, emphasizing the importance of LLM reasoning techniques in DTS.

* 19 pages, 3 figures, Accepted to Findings of the ACL 2025

Via

Access Paper or Ask Questions

sudo rm -rf agentic_security

Mar 26, 2025

Sejin Lee, Jian Kim, Haon Park, Ashkan Yousefpour, Sangyoon Yu, Min Song

Figure 1 for sudo rm -rf agentic_security

Figure 2 for sudo rm -rf agentic_security

Figure 3 for sudo rm -rf agentic_security

Figure 4 for sudo rm -rf agentic_security

Abstract:Large Language Models (LLMs) are increasingly deployed as computer-use agents, autonomously performing tasks within real desktop or web environments. While this evolution greatly expands practical use cases for humans, it also creates serious security exposures. We present SUDO (Screen-based Universal Detox2Tox Offense), a novel attack framework that systematically bypasses refusal trained safeguards in commercial computer-use agents, such as Claude Computer Use. The core mechanism, Detox2Tox, transforms harmful requests (that agents initially reject) into seemingly benign requests via detoxification, secures detailed instructions from advanced vision language models (VLMs), and then reintroduces malicious content via toxification just before execution. Unlike conventional jailbreaks, SUDO iteratively refines its attacks based on a built-in refusal feedback, making it increasingly effective against robust policy filters. In extensive tests spanning 50 real-world tasks and multiple state-of-the-art VLMs, SUDO achieves a stark attack success rate of 24% (with no refinement), and up to 41% (by its iterative refinement) in Claude Computer Use. By revealing these vulnerabilities and demonstrating the ease with which they can be exploited in real-world computing environments, this paper highlights an immediate need for robust, context-aware safeguards. WARNING: This paper includes harmful or offensive model outputs.

Via

Access Paper or Ask Questions

TIPO: Text to Image with Text Presampling for Prompt Optimization

Nov 12, 2024

Shih-Ying Yeh, Sang-Hyun Park, Giyeong Oh, Min Song, Youngjae Yu

Figure 1 for TIPO: Text to Image with Text Presampling for Prompt Optimization

Figure 2 for TIPO: Text to Image with Text Presampling for Prompt Optimization

Figure 3 for TIPO: Text to Image with Text Presampling for Prompt Optimization

Figure 4 for TIPO: Text to Image with Text Presampling for Prompt Optimization

Abstract:TIPO (Text to Image with text pre-sampling for Prompt Optimization) is an innovative framework designed to enhance text-to-image (T2I) generation by language model (LM) for automatic prompt engineering. By refining and extending user-provided prompts, TIPO bridges the gap between simple inputs and the detailed prompts required for high-quality image generation. Unlike previous approaches that rely on Large Language Models (LLMs) or reinforcement learning (RL), TIPO adjusts user input prompts with the distribution of a trained prompt dataset, eliminating the need for complex runtime cost via lightweight model. This pre-sampling approach enables efficient and scalable prompt optimization, grounded in the model's training distribution. Experimental results demonstrate TIPO's effectiveness in improving aesthetic scores, reducing image corruption, and better aligning generated images with dataset distributions. These findings highlight the critical role of prompt engineering in T2I systems and open avenues for broader applications of automatic prompt refinement.

* 21 pages, 13 figures

Via

Access Paper or Ask Questions

Beyond Ontology in Dialogue State Tracking for Goal-Oriented Chatbot

Oct 30, 2024

Sejin Lee, Dongha Kim, Min Song

Abstract:Goal-oriented chatbots are essential for automating user tasks, such as booking flights or making restaurant reservations. A key component of these systems is Dialogue State Tracking (DST), which interprets user intent and maintains the dialogue state. However, existing DST methods often rely on fixed ontologies and manually compiled slot values, limiting their adaptability to open-domain dialogues. We propose a novel approach that leverages instruction tuning and advanced prompt strategies to enhance DST performance, without relying on any predefined ontologies. Our method enables Large Language Model (LLM) to infer dialogue states through carefully designed prompts and includes an anti-hallucination mechanism to ensure accurate tracking in diverse conversation contexts. Additionally, we employ a Variational Graph Auto-Encoder (VGAE) to model and predict subsequent user intent. Our approach achieved state-of-the-art with a JGA of 42.57% outperforming existing ontology-less DST models, and performed well in open-domain real-world conversations. This work presents a significant advancement in creating more adaptive and accurate goal-oriented chatbots.

* There are 10 chapters, including references, and 2 figures used. To be presented at the 15th IEEE International Conference on Knowledge Graphs (ICKG2024)

Via

Access Paper or Ask Questions

Illustrious: an Open Advanced Illustration Model

Sep 30, 2024

Sang Hyun Park, Jun Young Koh, Junha Lee, Joy Song, Dongha Kim, Hoyeon Moon, Hyunju Lee, Min Song

Figure 1 for Illustrious: an Open Advanced Illustration Model

Figure 2 for Illustrious: an Open Advanced Illustration Model

Figure 3 for Illustrious: an Open Advanced Illustration Model

Figure 4 for Illustrious: an Open Advanced Illustration Model

Abstract:In this work, we share the insights for achieving state-of-the-art quality in our text-to-image anime image generative model, called Illustrious. To achieve high resolution, dynamic color range images, and high restoration ability, we focus on three critical approaches for model improvement. First, we delve into the significance of the batch size and dropout control, which enables faster learning of controllable token based concept activations. Second, we increase the training resolution of images, affecting the accurate depiction of character anatomy in much higher resolution, extending its generation capability over 20MP with proper methods. Finally, we propose the refined multi-level captions, covering all tags and various natural language captions as a critical factor for model development. Through extensive analysis and experiments, Illustrious demonstrates state-of-the-art performance in terms of animation style, outperforming widely-used models in illustration domains, propelling easier customization and personalization with nature of open source. We plan to publicly release updated Illustrious model series sequentially as well as sustainable plans for improvements.

Via

Access Paper or Ask Questions

CAT: Contrastive Adapter Training for Personalized Image Generation

Apr 11, 2024

Jae Wan Park, Sang Hyun Park, Jun Young Koh, Junha Lee, Min Song

Figure 1 for CAT: Contrastive Adapter Training for Personalized Image Generation

Figure 2 for CAT: Contrastive Adapter Training for Personalized Image Generation

Figure 3 for CAT: Contrastive Adapter Training for Personalized Image Generation

Figure 4 for CAT: Contrastive Adapter Training for Personalized Image Generation

Abstract:The emergence of various adapters, including Low-Rank Adaptation (LoRA) applied from the field of natural language processing, has allowed diffusion models to personalize image generation at a low cost. However, due to the various challenges including limited datasets and shortage of regularization and computation resources, adapter training often results in unsatisfactory outcomes, leading to the corruption of the backbone model's prior knowledge. One of the well known phenomena is the loss of diversity in object generation, especially within the same class which leads to generating almost identical objects with minor variations. This poses challenges in generation capabilities. To solve this issue, we present Contrastive Adapter Training (CAT), a simple yet effective strategy to enhance adapter training through the application of CAT loss. Our approach facilitates the preservation of the base model's original knowledge when the model initiates adapters. Furthermore, we introduce the Knowledge Preservation Score (KPS) to evaluate CAT's ability to keep the former information. We qualitatively and quantitatively compare CAT's improvement. Finally, we mention the possibility of CAT in the aspects of multi-concept adapter and optimization.

* CVPRW 2024

Via

Access Paper or Ask Questions

TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation

Jan 16, 2024

Taeyang Yun, Hyunkuk Lim, Jeonghwan Lee, Min Song

Figure 1 for TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation

Figure 2 for TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation

Figure 3 for TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation

Figure 4 for TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation

Abstract:Emotion Recognition in Conversation (ERC) plays a crucial role in enabling dialogue systems to effectively respond to user requests. The emotions in a conversation can be identified by the representations from various modalities, such as audio, visual, and text. However, due to the weak contribution of non-verbal modalities to recognize emotions, multimodal ERC has always been considered a challenging task. In this paper, we propose Teacher-leading Multimodal fusion network for ERC (TelME). TelME incorporates cross-modal knowledge distillation to transfer information from a language model acting as the teacher to the non-verbal students, thereby optimizing the efficacy of the weak modalities. We then combine multimodal features using a shifting fusion approach in which student networks support the teacher. TelME achieves state-of-the-art performance in MELD, a multi-speaker conversation dataset for ERC. Finally, we demonstrate the effectiveness of our components through additional experiments.

* 13 pages, 7 figures

Via

Access Paper or Ask Questions

Topic Segmentation Model Focusing on Local Context

Jan 05, 2023

Jeonghwan Lee, Jiyeong Han, Sunghoon Baek, Min Song

Figure 1 for Topic Segmentation Model Focusing on Local Context

Figure 2 for Topic Segmentation Model Focusing on Local Context

Figure 3 for Topic Segmentation Model Focusing on Local Context

Figure 4 for Topic Segmentation Model Focusing on Local Context

Abstract:Topic segmentation is important in understanding scientific documents since it can not only provide better readability but also facilitate downstream tasks such as information retrieval and question answering by creating appropriate sections or paragraphs. In the topic segmentation task, topic coherence is critical in predicting segmentation boundaries. Most of the existing models have tried to exploit as many contexts as possible to extract useful topic-related information. However, additional context does not always bring promising results, because the local context between sentences becomes incoherent despite more sentences being supplemented. To alleviate this issue, we propose siamese sentence embedding layers which process two input sentences independently to get appropriate amount of information without being hampered by excessive information. Also, we adopt multi-task learning techniques including Same Topic Prediction (STP), Topic Classification (TC) and Next Sentence Prediction (NSP). When these three classification layers are combined in a multi-task manner, they can make up for each other's limitations, improving performance in all three tasks. We experiment different combinations of the three layers and report how each layer affects other layers in the same combination as well as the overall segmentation performance. The model we proposed achieves the state-of-the-art result in the WikiSection dataset.

* Accepted to AAAI23-SDU23 Workshop Program

Via

Access Paper or Ask Questions

Over-the-Air Federated Learning with Enhanced Privacy

Dec 22, 2022

Xiaochan Xue, Moh Khalid Hasan, Shucheng Yu, Laxima Niure Kandel, Min Song

Figure 1 for Over-the-Air Federated Learning with Enhanced Privacy

Figure 2 for Over-the-Air Federated Learning with Enhanced Privacy

Figure 3 for Over-the-Air Federated Learning with Enhanced Privacy

Figure 4 for Over-the-Air Federated Learning with Enhanced Privacy

Abstract:Federated learning (FL) has emerged as a promising learning paradigm in which only local model parameters (gradients) are shared. Private user data never leaves the local devices thus preserving data privacy. However, recent research has shown that even when local data is never shared by a user, exchanging model parameters without protection can also leak private information. Moreover, in wireless systems, the frequent transmission of model parameters can cause tremendous bandwidth consumption and network congestion when the model is large. To address this problem, we propose a new FL framework with efficient over-the-air parameter aggregation and strong privacy protection of both user data and models. We achieve this by introducing pairwise cancellable random artificial noises (PCR-ANs) on end devices. As compared to existing over-the-air computation (AirComp) based FL schemes, our design provides stronger privacy protection. We analytically show the secrecy capacity and the convergence rate of the proposed wireless FL aggregation algorithm.

* 6 pages

Via

Access Paper or Ask Questions