Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Boseung Jeong

Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval

Apr 03, 2025

Boseung Jeong, Jicheol Park, Sungyeon Kim, Suha Kwak

Abstract:Video-text retrieval, the task of retrieving videos based on a textual query or vice versa, is of paramount importance for video understanding and multimodal information retrieval. Recent methods in this area rely primarily on visual and textual features and often ignore audio, although it helps enhance overall comprehension of video content. Moreover, traditional models that incorporate audio blindly utilize the audio input regardless of whether it is useful or not, resulting in suboptimal video representation. To address these limitations, we propose a novel video-text retrieval framework, Audio-guided VIdeo representation learning with GATEd attention (AVIGATE), that effectively leverages audio cues through a gated attention mechanism that selectively filters out uninformative audio signals. In addition, we propose an adaptive margin-based contrastive loss to deal with the inherently unclear positive-negative relationship between video and text, which facilitates learning better video-text alignment. Our extensive experiments demonstrate that AVIGATE achieves state-of-the-art performance on all the public benchmarks.

* Accepted to CVPR 2025

Via

Access Paper or Ask Questions

Improving Text-based Person Search via Part-level Cross-modal Correspondence

Dec 31, 2024

Jicheol Park, Boseung Jeong, Dongwon Kim, Suha Kwak

Figure 1 for Improving Text-based Person Search via Part-level Cross-modal Correspondence

Figure 2 for Improving Text-based Person Search via Part-level Cross-modal Correspondence

Figure 3 for Improving Text-based Person Search via Part-level Cross-modal Correspondence

Figure 4 for Improving Text-based Person Search via Part-level Cross-modal Correspondence

Abstract:Text-based person search is the task of finding person images that are the most relevant to the natural language text description given as query. The main challenge of this task is a large gap between the target images and text queries, which makes it difficult to establish correspondence and distinguish subtle differences across people. To address this challenge, we introduce an efficient encoder-decoder model that extracts coarse-to-fine embedding vectors which are semantically aligned across the two modalities without supervision for the alignment. There is another challenge of learning to capture fine-grained information with only person IDs as supervision, where similar body parts of different individuals are considered different due to the lack of part-level supervision. To tackle this, we propose a novel ranking loss, dubbed commonality-based margin ranking loss, which quantifies the degree of commonality of each body part and reflects it during the learning of fine-grained body part details. As a consequence, it enables our method to achieve the best records on three public benchmarks.

Via

Access Paper or Ask Questions

Efficient and Versatile Robust Fine-Tuning of Zero-shot Models

Aug 11, 2024

Sungyeon Kim, Boseung Jeong, Donghyun Kim, Suha Kwak

Figure 1 for Efficient and Versatile Robust Fine-Tuning of Zero-shot Models

Figure 2 for Efficient and Versatile Robust Fine-Tuning of Zero-shot Models

Figure 3 for Efficient and Versatile Robust Fine-Tuning of Zero-shot Models

Figure 4 for Efficient and Versatile Robust Fine-Tuning of Zero-shot Models

Abstract:Large-scale image-text pre-trained models enable zero-shot classification and provide consistent accuracy across various data distributions. Nonetheless, optimizing these models in downstream tasks typically requires fine-tuning, which reduces generalization to out-of-distribution (OOD) data and demands extensive computational resources. We introduce Robust Adapter (R-Adapter), a novel method for fine-tuning zero-shot models to downstream tasks while simultaneously addressing both these issues. Our method integrates lightweight modules into the pre-trained model and employs novel self-ensemble techniques to boost OOD robustness and reduce storage expenses substantially. Furthermore, we propose MPM-NCE loss designed for fine-tuning on vision-language downstream tasks. It ensures precise alignment of multiple image-text pairs and discriminative feature learning. By extending the benchmark for robust fine-tuning beyond classification to include diverse tasks such as cross-modal retrieval and open vocabulary segmentation, we demonstrate the broad applicability of R-Adapter. Our extensive experiments demonstrate that R-Adapter achieves state-of-the-art performance across a diverse set of tasks, tuning only 13% of the parameters of the CLIP encoders.

* Accepted to ECCV 2024

Via

Access Paper or Ask Questions

Human Pose Estimation in Extremely Low-Light Conditions

Mar 27, 2023

Sohyun Lee, Jaesung Rim, Boseung Jeong, Geonu Kim, Byungju Woo, Haechan Lee, Sunghyun Cho, Suha Kwak

Abstract:We study human pose estimation in extremely low-light images. This task is challenging due to the difficulty of collecting real low-light images with accurate labels, and severely corrupted inputs that degrade prediction quality significantly. To address the first issue, we develop a dedicated camera system and build a new dataset of real low-light images with accurate pose labels. Thanks to our camera system, each low-light image in our dataset is coupled with an aligned well-lit image, which enables accurate pose labeling and is used as privileged information during training. We also propose a new model and a new training strategy that fully exploit the privileged information to learn representation insensitive to lighting conditions. Our method demonstrates outstanding performance on real extremely low light images, and extensive analyses validate that both of our model and dataset contribute to the success.

* Accepted to CVPR 2023

Via

Access Paper or Ask Questions

ASMR: Learning Attribute-Based Person Search with Adaptive Semantic Margin Regularizer

Aug 10, 2021

Boseung Jeong, Jicheol Park, Suha Kwak

Figure 1 for ASMR: Learning Attribute-Based Person Search with Adaptive Semantic Margin Regularizer

Figure 2 for ASMR: Learning Attribute-Based Person Search with Adaptive Semantic Margin Regularizer

Figure 3 for ASMR: Learning Attribute-Based Person Search with Adaptive Semantic Margin Regularizer

Figure 4 for ASMR: Learning Attribute-Based Person Search with Adaptive Semantic Margin Regularizer

Abstract:Attribute-based person search is the task of finding person images that are best matched with a set of text attributes given as query. The main challenge of this task is the large modality gap between attributes and images. To reduce the gap, we present a new loss for learning cross-modal embeddings in the context of attribute-based person search. We regard a set of attributes as a category of people sharing the same traits. In a joint embedding space of the two modalities, our loss pulls images close to their person categories for modality alignment. More importantly, it pushes apart a pair of person categories by a margin determined adaptively by their semantic distance, where the distance metric is learned end-to-end so that the loss considers importance of each attribute when relating person categories. Our loss guided by the adaptive semantic margin leads to more discriminative and semantically well-arranged distributions of person images. As a consequence, it enables a simple embedding model to achieve state-of-the-art records on public benchmarks without bells and whistles.

* ICCV 2021 accepted

Via

Access Paper or Ask Questions