Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jae Won Cho

Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding

Oct 17, 2024

Jongbhin Woo, Hyeonggon Ryu, Youngjoon Jang, Jae Won Cho, Joon Son Chung

Figure 1 for Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding

Figure 2 for Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding

Figure 3 for Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding

Figure 4 for Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding

Abstract:Video Temporal Grounding (VTG) aims to identify visual frames in a video clip that match text queries. Recent studies in VTG employ cross-attention to correlate visual frames and text queries as individual token sequences. However, these approaches overlook a crucial aspect of the problem: a holistic understanding of the query sentence. A model may capture correlations between individual word tokens and arbitrary visual frames while possibly missing out on the global meaning. To address this, we introduce two primary contributions: (1) a visual frame-level gate mechanism that incorporates holistic textual information, (2) cross-modal alignment loss to learn the fine-grained correlation between query and relevant frames. As a result, we regularize the effect of individual word tokens and suppress irrelevant visual frames. We demonstrate that our method outperforms state-of-the-art approaches in VTG benchmarks, indicating that holistic text understanding guides the model to focus on the semantically important parts within the video.

* Accepted by ACMMM 24

Via

Access Paper or Ask Questions

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Oct 07, 2024

Youngtaek Oh, Jae Won Cho, Dong-Jin Kim, In So Kweon, Junmo Kim

Figure 1 for Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Figure 2 for Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Figure 3 for Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Figure 4 for Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Abstract:In this paper, we propose a new method to enhance compositional understanding in pre-trained vision and language models (VLMs) without sacrificing performance in zero-shot multi-modal tasks. Traditional fine-tuning approaches often improve compositional reasoning at the cost of degrading multi-modal capabilities, primarily due to the use of global hard negative (HN) loss, which contrasts global representations of images and texts. This global HN loss pushes HN texts that are highly similar to the original ones, damaging the model's multi-modal representations. To overcome this limitation, we propose Fine-grained Selective Calibrated CLIP (FSC-CLIP), which integrates local hard negative loss and selective calibrated regularization. These innovations provide fine-grained negative supervision while preserving the model's representational integrity. Our extensive evaluations across diverse benchmarks for both compositionality and multi-modal tasks show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities. Code is available at: https://github.com/ytaek-oh/fsc-clip.

* EMNLP 2024 (Long, Main). Project page: https://ytaek-oh.github.io/fsc-clip

Via

Access Paper or Ask Questions

NICE: CVPR 2023 Challenge on Zero-shot Image Captioning

Sep 11, 2023

Taehoon Kim, Pyunghwan Ahn, Sangyun Kim, Sihaeng Lee, Mark Marsden, Alessandra Sala, Seung Hwan Kim, Bohyung Han, Kyoung Mu Lee, Honglak Lee(+32 more)

Figure 1 for NICE: CVPR 2023 Challenge on Zero-shot Image Captioning

Figure 2 for NICE: CVPR 2023 Challenge on Zero-shot Image Captioning

Figure 3 for NICE: CVPR 2023 Challenge on Zero-shot Image Captioning

Figure 4 for NICE: CVPR 2023 Challenge on Zero-shot Image Captioning

Abstract:In this report, we introduce NICE (New frontiers for zero-shot Image Captioning Evaluation) project and share the results and outcomes of 2023 challenge. This project is designed to challenge the computer vision community to develop robust image captioning models that advance the state-of-the-art both in terms of accuracy and fairness. Through the challenge, the image captioning models were tested using a new evaluation dataset that includes a large variety of visual concepts from many domains. There was no specific training data provided for the challenge, and therefore the challenge entries were required to adapt to new types of image descriptions that had not been seen during training. This report includes information on the newly proposed NICE dataset, evaluation methods, challenge results, and technical details of top-ranking entries. We expect that the outcomes of the challenge will contribute to the improvement of AI models on various vision-language tasks.

* Tech report, project page https://nice.lgresearch.ai/

Via

Access Paper or Ask Questions

Self-Sufficient Framework for Continuous Sign Language Recognition

Mar 21, 2023

Youngjoon Jang, Youngtaek Oh, Jae Won Cho, Myungchul Kim, Dong-Jin Kim, In So Kweon, Joon Son Chung

Abstract:The goal of this work is to develop self-sufficient framework for Continuous Sign Language Recognition (CSLR) that addresses key issues of sign language recognition. These include the need for complex multi-scale features such as hands, face, and mouth for understanding, and absence of frame-level annotations. To this end, we propose (1) Divide and Focus Convolution (DFConv) which extracts both manual and non-manual features without the need for additional networks or annotations, and (2) Dense Pseudo-Label Refinement (DPLR) which propagates non-spiky frame-level pseudo-labels by combining the ground truth gloss sequence labels with the predicted sequence. We demonstrate that our model achieves state-of-the-art performance among RGB-based methods on large-scale CSLR benchmarks, PHOENIX-2014 and PHOENIX-2014-T, while showing comparable results with better efficiency when compared to other approaches that use multi-modality or extra annotations.

Via

Access Paper or Ask Questions

Signing Outside the Studio: Benchmarking Background Robustness for Continuous Sign Language Recognition

Nov 01, 2022

Youngjoon Jang, Youngtaek Oh, Jae Won Cho, Dong-Jin Kim, Joon Son Chung, In So Kweon

Figure 1 for Signing Outside the Studio: Benchmarking Background Robustness for Continuous Sign Language Recognition

Figure 2 for Signing Outside the Studio: Benchmarking Background Robustness for Continuous Sign Language Recognition

Figure 3 for Signing Outside the Studio: Benchmarking Background Robustness for Continuous Sign Language Recognition

Figure 4 for Signing Outside the Studio: Benchmarking Background Robustness for Continuous Sign Language Recognition

Abstract:The goal of this work is background-robust continuous sign language recognition. Most existing Continuous Sign Language Recognition (CSLR) benchmarks have fixed backgrounds and are filmed in studios with a static monochromatic background. However, signing is not limited only to studios in the real world. In order to analyze the robustness of CSLR models under background shifts, we first evaluate existing state-of-the-art CSLR models on diverse backgrounds. To synthesize the sign videos with a variety of backgrounds, we propose a pipeline to automatically generate a benchmark dataset utilizing existing CSLR benchmarks. Our newly constructed benchmark dataset consists of diverse scenes to simulate a real-world environment. We observe even the most recent CSLR method cannot recognize glosses well on our new dataset with changed backgrounds. In this regard, we also propose a simple yet effective training scheme including (1) background randomization and (2) feature disentanglement for CSLR models. The experimental results on our dataset demonstrate that our method generalizes well to other unseen background data with minimal additional training images.

* Our dataset is available at https://github.com/art-jang/Signing-Outside-the-Studio

Via

Access Paper or Ask Questions

Generative Bias for Visual Question Answering

Aug 02, 2022

Jae Won Cho, Dong-jin Kim, Hyeonggon Ryu, In So Kweon

Figure 1 for Generative Bias for Visual Question Answering

Figure 2 for Generative Bias for Visual Question Answering

Figure 3 for Generative Bias for Visual Question Answering

Figure 4 for Generative Bias for Visual Question Answering

Abstract:The task of Visual Question Answering (VQA) is known to be plagued by the issue of VQA models exploiting biases within the dataset to make its final prediction. Many previous ensemble based debiasing methods have been proposed where an additional model is purposefully trained to be biased in order to aid in training a robust target model. However, these methods compute the bias for a model from the label statistics of the training data or directly from single modal branches. In contrast, in this work, in order to better learn the bias a target VQA model suffers from, we propose a generative method to train the bias model \emph{directly from the target model}, called GenB. In particular, GenB employs a generative network to learn the bias through a combination of the adversarial objective and knowledge distillation. We then debias our target model with GenB as a bias model, and show through extensive experiments the effects of our method on various VQA bias datasets including VQA-CP2, VQA-CP1, GQA-OOD, and VQA-CE.

* 10 pages, Bronze Prize, 28th HumanTech Paper Award, Samsung Electronics

Via

Access Paper or Ask Questions

Investigating Top-$k$ White-Box and Transferable Black-box Attack

Mar 30, 2022

Chaoning Zhang, Philipp Benz, Adil Karjauv, Jae Won Cho, Kang Zhang, In So Kweon

Figure 1 for Investigating Top-$k$ White-Box and Transferable Black-box Attack

Figure 2 for Investigating Top-$k$ White-Box and Transferable Black-box Attack

Figure 3 for Investigating Top-$k$ White-Box and Transferable Black-box Attack

Figure 4 for Investigating Top-$k$ White-Box and Transferable Black-box Attack

Abstract:Existing works have identified the limitation of top-$1$ attack success rate (ASR) as a metric to evaluate the attack strength but exclusively investigated it in the white-box setting, while our work extends it to a more practical black-box setting: transferable attack. It is widely reported that stronger I-FGSM transfers worse than simple FGSM, leading to a popular belief that transferability is at odds with the white-box attack strength. Our work challenges this belief with empirical finding that stronger attack actually transfers better for the general top-$k$ ASR indicated by the interest class rank (ICR) after attack. For increasing the attack strength, with an intuitive interpretation of the logit gradient from the geometric perspective, we identify that the weakness of the commonly used losses lie in prioritizing the speed to fool the network instead of maximizing its strength. To this end, we propose a new normalized CE loss that guides the logit to be updated in the direction of implicitly maximizing its rank distance from the ground-truth class. Extensive results in various settings have verified that our proposed new loss is simple yet effective for top-$k$ attack. Code is available at: \url{https://bit.ly/3uCiomP}

* Accepted by CVPR2022

Via

Access Paper or Ask Questions

Single-Modal Entropy based Active Learning for Visual Question Answering

Nov 18, 2021

Dong-Jin Kim, Jae Won Cho, Jinsoo Choi, Yunjae Jung, In So Kweon

Figure 1 for Single-Modal Entropy based Active Learning for Visual Question Answering

Figure 2 for Single-Modal Entropy based Active Learning for Visual Question Answering

Figure 3 for Single-Modal Entropy based Active Learning for Visual Question Answering

Figure 4 for Single-Modal Entropy based Active Learning for Visual Question Answering

Abstract:Constructing a large-scale labeled dataset in the real world, especially for high-level tasks (eg, Visual Question Answering), can be expensive and time-consuming. In addition, with the ever-growing amounts of data and architecture complexity, Active Learning has become an important aspect of computer vision research. In this work, we address Active Learning in the multi-modal setting of Visual Question Answering (VQA). In light of the multi-modal inputs, image and question, we propose a novel method for effective sample acquisition through the use of ad hoc single-modal branches for each input to leverage its information. Our mutual information based sample acquisition strategy Single-Modal Entropic Measure (SMEM) in addition to our self-distillation technique enables the sample acquisitor to exploit all present modalities and find the most informative samples. Our novel idea is simple to implement, cost-efficient, and readily adaptable to other multi-modal tasks. We confirm our findings on various VQA datasets through state-of-the-art performance by comparing to existing Active Learning baselines.

* Accepted to BMVC 2021

Via

Access Paper or Ask Questions

Correlate-and-Excite: Real-Time Stereo Matching via Guided Cost Volume Excitation

Aug 12, 2021

Antyanta Bangunharcana, Jae Won Cho, Seokju Lee, In So Kweon, Kyung-Soo Kim, Soohyun Kim

Figure 1 for Correlate-and-Excite: Real-Time Stereo Matching via Guided Cost Volume Excitation

Figure 2 for Correlate-and-Excite: Real-Time Stereo Matching via Guided Cost Volume Excitation

Figure 3 for Correlate-and-Excite: Real-Time Stereo Matching via Guided Cost Volume Excitation

Figure 4 for Correlate-and-Excite: Real-Time Stereo Matching via Guided Cost Volume Excitation

Abstract:Volumetric deep learning approach towards stereo matching aggregates a cost volume computed from input left and right images using 3D convolutions. Recent works showed that utilization of extracted image features and a spatially varying cost volume aggregation complements 3D convolutions. However, existing methods with spatially varying operations are complex, cost considerable computation time, and cause memory consumption to increase. In this work, we construct Guided Cost volume Excitation (GCE) and show that simple channel excitation of cost volume guided by image can improve performance considerably. Moreover, we propose a novel method of using top-k selection prior to soft-argmin disparity regression for computing the final disparity estimate. Combining our novel contributions, we present an end-to-end network that we call Correlate-and-Excite (CoEx). Extensive experiments of our model on the SceneFlow, KITTI 2012, and KITTI 2015 datasets demonstrate the effectiveness and efficiency of our model and show that our model outperforms other speed-based algorithms while also being competitive to other state-of-the-art algorithms. Codes will be made available at https://github.com/antabangun/coex.

* To appear at IROS 2021. Code is available at https://github.com/antabangun/coex

Via

Access Paper or Ask Questions

LabOR: Labeling Only if Required for Domain Adaptive Semantic Segmentation

Aug 12, 2021

Inkyu Shin, Dong-jin Kim, Jae Won Cho, Sanghyun Woo, Kwanyong Park, In So Kweon

Figure 1 for LabOR: Labeling Only if Required for Domain Adaptive Semantic Segmentation

Figure 2 for LabOR: Labeling Only if Required for Domain Adaptive Semantic Segmentation

Figure 3 for LabOR: Labeling Only if Required for Domain Adaptive Semantic Segmentation

Figure 4 for LabOR: Labeling Only if Required for Domain Adaptive Semantic Segmentation

Abstract:Unsupervised Domain Adaptation (UDA) for semantic segmentation has been actively studied to mitigate the domain gap between label-rich source data and unlabeled target data. Despite these efforts, UDA still has a long way to go to reach the fully supervised performance. To this end, we propose a Labeling Only if Required strategy, LabOR, where we introduce a human-in-the-loop approach to adaptively give scarce labels to points that a UDA model is uncertain about. In order to find the uncertain points, we generate an inconsistency mask using the proposed adaptive pixel selector and we label these segment-based regions to achieve near supervised performance with only a small fraction (about 2.2%) ground truth points, which we call "Segment based Pixel-Labeling (SPL)". To further reduce the efforts of the human annotator, we also propose "Point-based Pixel-Labeling (PPL)", which finds the most representative points for labeling within the generated inconsistency mask. This reduces efforts from 2.2% segment label to 40 points label while minimizing performance degradation. Through extensive experimentation, we show the advantages of this new framework for domain adaptive semantic segmentation while minimizing human labor costs.

* Accepted to ICCV 2021 (Oral)

Via

Access Paper or Ask Questions