Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sangheum Hwang

Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations

Mar 24, 2025

Jeonghyeon Kim, Sangheum Hwang

Abstract:Prior research on out-of-distribution detection (OoDD) has primarily focused on single-modality models. Recently, with the advent of large-scale pretrained vision-language models such as CLIP, OoDD methods utilizing such multi-modal representations through zero-shot and prompt learning strategies have emerged. However, these methods typically involve either freezing the pretrained weights or only partially tuning them, which can be suboptimal for downstream datasets. In this paper, we highlight that multi-modal fine-tuning (MMFT) can achieve notable OoDD performance. Despite some recent works demonstrating the impact of fine-tuning methods for OoDD, there remains significant potential for performance improvement. We investigate the limitation of na\"ive fine-tuning methods, examining why they fail to fully leverage the pretrained knowledge. Our empirical analysis suggests that this issue could stem from the modality gap within in-distribution (ID) embeddings. To address this, we propose a training objective that enhances cross-modal alignment by regularizing the distances between image and text embeddings of ID data. This adjustment helps in better utilizing pretrained textual information by aligning similar semantics from different modalities (i.e., text and image) more closely in the hyperspherical representation space. We theoretically demonstrate that the proposed regularization corresponds to the maximum likelihood estimation of an energy-based model on a hypersphere. Utilizing ImageNet-1k OoD benchmark datasets, we show that our method, combined with post-hoc OoDD approaches leveraging pretrained knowledge (e.g., NegLabel), significantly outperforms existing methods, achieving state-of-the-art OoDD performance and leading ID accuracy.

* CVPR 2025

Via

Access Paper or Ask Questions

Reflexive Guidance: Improving OoDD in Vision-Language Models via Self-Guided Image-Adaptive Concept Generation

Oct 19, 2024

Seulbi Lee, Jihyo Kim, Sangheum Hwang

Abstract:With the recent emergence of foundation models trained on internet-scale data and demonstrating remarkable generalization capabilities, such foundation models have become more widely adopted, leading to an expanding range of application domains. Despite this rapid proliferation, the trustworthiness of foundation models remains underexplored. Specifically, the out-of-distribution detection (OoDD) capabilities of large vision-language models (LVLMs), such as GPT-4o, which are trained on massive multi-modal data, have not been sufficiently addressed. The disparity between their demonstrated potential and practical reliability raises concerns regarding the safe and trustworthy deployment of foundation models. To address this gap, we evaluate and analyze the OoDD capabilities of various proprietary and open-source LVLMs. Our investigation contributes to a better understanding of how these foundation models represent confidence scores through their generated natural language responses. Based on our observations, we propose a self-guided prompting approach, termed \emph{Reflexive Guidance (ReGuide)}, aimed at enhancing the OoDD capability of LVLMs by leveraging self-generated image-adaptive concept suggestions. Experimental results demonstrate that our ReGuide enhances the performance of current LVLMs in both image classification and OoDD tasks.

* The first two authors contributed equally

Via

Access Paper or Ask Questions

StochCA: A Novel Approach for Exploiting Pretrained Models with Cross-Attention

Feb 25, 2024

Seungwon Seo, Suho Lee, Sangheum Hwang

Figure 1 for StochCA: A Novel Approach for Exploiting Pretrained Models with Cross-Attention

Figure 2 for StochCA: A Novel Approach for Exploiting Pretrained Models with Cross-Attention

Figure 3 for StochCA: A Novel Approach for Exploiting Pretrained Models with Cross-Attention

Figure 4 for StochCA: A Novel Approach for Exploiting Pretrained Models with Cross-Attention

Abstract:Utilizing large-scale pretrained models is a well-known strategy to enhance performance on various target tasks. It is typically achieved through fine-tuning pretrained models on target tasks. However, na\"{\i}ve fine-tuning may not fully leverage knowledge embedded in pretrained models. In this study, we introduce a novel fine-tuning method, called stochastic cross-attention (StochCA), specific to Transformer architectures. This method modifies the Transformer's self-attention mechanism to selectively utilize knowledge from pretrained models during fine-tuning. Specifically, in each block, instead of self-attention, cross-attention is performed stochastically according to the predefined probability, where keys and values are extracted from the corresponding block of a pretrained model. By doing so, queries and channel-mixing multi-layer perceptron layers of a target model are fine-tuned to target tasks to learn how to effectively exploit rich representations of pretrained models. To verify the effectiveness of StochCA, extensive experiments are conducted on benchmarks in the areas of transfer learning and domain generalization, where the exploitation of pretrained models is critical. Our experimental results show the superiority of StochCA over state-of-the-art approaches in both areas. Furthermore, we demonstrate that StochCA is complementary to existing approaches, i.e., it can be combined with them to further improve performance. Our code is available at https://github.com/daintlab/stochastic_cross_attention

* The first two authors contributed equally

Via

Access Paper or Ask Questions

GTA: Guided Transfer of Spatial Attention from Object-Centric Representations

Jan 05, 2024

SeokHyun Seo, Jinwoo Hong, JungWoo Chae, Kyungyul Kim, Sangheum Hwang

Abstract:Utilizing well-trained representations in transfer learning often results in superior performance and faster convergence compared to training from scratch. However, even if such good representations are transferred, a model can easily overfit the limited training dataset and lose the valuable properties of the transferred representations. This phenomenon is more severe in ViT due to its low inductive bias. Through experimental analysis using attention maps in ViT, we observe that the rich representations deteriorate when trained on a small dataset. Motivated by this finding, we propose a novel and simple regularization method for ViT called Guided Transfer of spatial Attention (GTA). Our proposed method regularizes the self-attention maps between the source and target models. A target model can fully exploit the knowledge related to object localization properties through this explicit regularization. Our experimental results show that the proposed GTA consistently improves the accuracy across five benchmark datasets especially when the number of training data is small.

Via

Access Paper or Ask Questions

Few-shot Fine-tuning is All You Need for Source-free Domain Adaptation

Apr 24, 2023

Suho Lee, Seungwon Seo, Jihyo Kim, Yejin Lee, Sangheum Hwang

Abstract:Recently, source-free unsupervised domain adaptation (SFUDA) has emerged as a more practical and feasible approach compared to unsupervised domain adaptation (UDA) which assumes that labeled source data are always accessible. However, significant limitations associated with SFUDA approaches are often overlooked, which limits their practicality in real-world applications. These limitations include a lack of principled ways to determine optimal hyperparameters and performance degradation when the unlabeled target data fail to meet certain requirements such as a closed-set and identical label distribution to the source data. All these limitations stem from the fact that SFUDA entirely relies on unlabeled target data. We empirically demonstrate the limitations of existing SFUDA methods in real-world scenarios including out-of-distribution and label distribution shifts in target data, and verify that none of these methods can be safely applied to real-world settings. Based on our experimental results, we claim that fine-tuning a source pretrained model with a few labeled data (e.g., 1- or 3-shot) is a practical and reliable solution to circumvent the limitations of SFUDA. Contrary to common belief, we find that carefully fine-tuned models do not suffer from overfitting even when trained with only a few labeled data, and also show little change in performance due to sampling bias. Our experimental results on various domain adaptation benchmarks demonstrate that the few-shot fine-tuning approach performs comparatively under the standard SFUDA settings, and outperforms comparison methods under realistic scenarios. Our code is available at https://github.com/daintlab/fewshot-SFDA .

* The first two authors contributed equally

Via

Access Paper or Ask Questions

Rethinking Evaluation Protocols of Visual Representations Learned via Self-supervised Learning

Apr 07, 2023

Jae-Hun Lee, Doyoung Yoon, ByeongMoon Ji, Kyungyul Kim, Sangheum Hwang

Figure 1 for Rethinking Evaluation Protocols of Visual Representations Learned via Self-supervised Learning

Figure 2 for Rethinking Evaluation Protocols of Visual Representations Learned via Self-supervised Learning

Figure 3 for Rethinking Evaluation Protocols of Visual Representations Learned via Self-supervised Learning

Figure 4 for Rethinking Evaluation Protocols of Visual Representations Learned via Self-supervised Learning

Abstract:Linear probing (LP) (and $k$-NN) on the upstream dataset with labels (e.g., ImageNet) and transfer learning (TL) to various downstream datasets are commonly employed to evaluate the quality of visual representations learned via self-supervised learning (SSL). Although existing SSL methods have shown good performances under those evaluation protocols, we observe that the performances are very sensitive to the hyperparameters involved in LP and TL. We argue that this is an undesirable behavior since truly generic representations should be easily adapted to any other visual recognition task, i.e., the learned representations should be robust to the settings of LP and TL hyperparameters. In this work, we try to figure out the cause of performance sensitivity by conducting extensive experiments with state-of-the-art SSL methods. First, we find that input normalization for LP is crucial to eliminate performance variations according to the hyperparameters. Specifically, batch normalization before feeding inputs to a linear classifier considerably improves the stability of evaluation, and also resolves inconsistency of $k$-NN and LP metrics. Second, for TL, we demonstrate that a weight decay parameter in SSL significantly affects the transferability of learned representations, which cannot be identified by LP or $k$-NN evaluations on the upstream dataset. We believe that the findings of this study will be beneficial for the community by drawing attention to the shortcomings in the current SSL evaluation schemes and underscoring the need to reconsider them.

Via

Access Paper or Ask Questions

Deep Active Learning with Contrastive Learning Under Realistic Data Pool Assumptions

Mar 25, 2023

Jihyo Kim, Jeonghyeon Kim, Sangheum Hwang

Figure 1 for Deep Active Learning with Contrastive Learning Under Realistic Data Pool Assumptions

Figure 2 for Deep Active Learning with Contrastive Learning Under Realistic Data Pool Assumptions

Figure 3 for Deep Active Learning with Contrastive Learning Under Realistic Data Pool Assumptions

Figure 4 for Deep Active Learning with Contrastive Learning Under Realistic Data Pool Assumptions

Abstract:Active learning aims to identify the most informative data from an unlabeled data pool that enables a model to reach the desired accuracy rapidly. This benefits especially deep neural networks which generally require a huge number of labeled samples to achieve high performance. Most existing active learning methods have been evaluated in an ideal setting where only samples relevant to the target task, i.e., in-distribution samples, exist in an unlabeled data pool. A data pool gathered from the wild, however, is likely to include samples that are irrelevant to the target task at all and/or too ambiguous to assign a single class label even for the oracle. We argue that assuming an unlabeled data pool consisting of samples from various distributions is more realistic. In this work, we introduce new active learning benchmarks that include ambiguous, task-irrelevant out-of-distribution as well as in-distribution samples. We also propose an active learning method designed to acquire informative in-distribution samples in priority. The proposed method leverages both labeled and unlabeled data pools and selects samples from clusters on the feature space constructed via contrastive learning. Experimental results demonstrate that the proposed method requires a lower annotation budget than existing active learning methods to reach the same level of accuracy.

* AAAI 2023 Workshop on Practical Deep Learning in the Wild

Via

Access Paper or Ask Questions

A Unified Benchmark for the Unknown Detection Capability of Deep Neural Networks

Dec 01, 2021

Jihyo Kim, Jiin Koo, Sangheum Hwang

Figure 1 for A Unified Benchmark for the Unknown Detection Capability of Deep Neural Networks

Figure 2 for A Unified Benchmark for the Unknown Detection Capability of Deep Neural Networks

Figure 3 for A Unified Benchmark for the Unknown Detection Capability of Deep Neural Networks

Figure 4 for A Unified Benchmark for the Unknown Detection Capability of Deep Neural Networks

Abstract:Deep neural networks have achieved outstanding performance over various tasks, but they have a critical issue: over-confident predictions even for completely unknown samples. Many studies have been proposed to successfully filter out these unknown samples, but they only considered narrow and specific tasks, referred to as misclassification detection, open-set recognition, or out-of-distribution detection. In this work, we argue that these tasks should be treated as fundamentally an identical problem because an ideal model should possess detection capability for all those tasks. Therefore, we introduce the unknown detection task, an integration of previous individual tasks, for a rigorous examination of the detection capability of deep neural networks on a wide spectrum of unknown samples. To this end, unified benchmark datasets on different scales were constructed and the unknown detection capabilities of existing popular methods were subject to comparison. We found that Deep Ensemble consistently outperforms the other approaches in detecting unknowns; however, all methods are only successful for a specific type of unknown. The reproducible code and benchmark datasets are available at https://github.com/daintlab/unknown-detection-benchmarks .

* Under review

Via

Access Paper or Ask Questions

Elucidating Noisy Data via Uncertainty-Aware Robust Learning

Nov 02, 2021

Jeongeun Park, Seungyoun Shin, Sangheum Hwang, Sungjoon Choi

Figure 1 for Elucidating Noisy Data via Uncertainty-Aware Robust Learning

Figure 2 for Elucidating Noisy Data via Uncertainty-Aware Robust Learning

Figure 3 for Elucidating Noisy Data via Uncertainty-Aware Robust Learning

Figure 4 for Elucidating Noisy Data via Uncertainty-Aware Robust Learning

Abstract:Robust learning methods aim to learn a clean target distribution from noisy and corrupted training data where a specific corruption pattern is often assumed a priori. Our proposed method can not only successfully learn the clean target distribution from a dirty dataset but also can estimate the underlying noise pattern. To this end, we leverage a mixture-of-experts model that can distinguish two different types of predictive uncertainty, aleatoric and epistemic uncertainty. We show that the ability to estimate the uncertainty plays a significant role in elucidating the corruption patterns as these two objectives are tightly intertwined. We also present a novel validation scheme for evaluating the performance of the corruption pattern estimation. Our proposed method is extensively assessed in terms of both robustness and corruption pattern estimation through a number of domains, including computer vision and natural language processing.

Via

Access Paper or Ask Questions

Confidence-Aware Learning for Deep Neural Networks

Jul 07, 2020

Jooyoung Moon, Jihyo Kim, Younghak Shin, Sangheum Hwang

Figure 1 for Confidence-Aware Learning for Deep Neural Networks

Figure 2 for Confidence-Aware Learning for Deep Neural Networks

Figure 3 for Confidence-Aware Learning for Deep Neural Networks

Figure 4 for Confidence-Aware Learning for Deep Neural Networks

Abstract:Despite the power of deep neural networks for a wide range of tasks, an overconfident prediction issue has limited their practical use in many safety-critical applications. Many recent works have been proposed to mitigate this issue, but most of them require either additional computational costs in training and/or inference phases or customized architectures to output confidence estimates separately. In this paper, we propose a method of training deep neural networks with a novel loss function, named Correctness Ranking Loss, which regularizes class probabilities explicitly to be better confidence estimates in terms of ordinal ranking according to confidence. The proposed method is easy to implement and can be applied to the existing architectures without any modification. Also, it has almost the same computational costs for training as conventional deep classifiers and outputs reliable predictions by a single inference. Extensive experimental results on classification benchmark datasets indicate that the proposed method helps networks to produce well-ranked confidence estimates. We also demonstrate that it is effective for the tasks closely related to confidence estimation, out-of-distribution detection and active learning.

* ICML 2020. The first two authors contributed equally

Via

Access Paper or Ask Questions