Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Junyeong Kim

Synergy-CLIP: Extending CLIP with Multi-modal Integration for Robust Representation Learning

Apr 30, 2025

Sangyeon Cho, Jangyeong Jeon, Mingi Kim, Junyeong Kim

Figure 1 for Synergy-CLIP: Extending CLIP with Multi-modal Integration for Robust Representation Learning

Figure 2 for Synergy-CLIP: Extending CLIP with Multi-modal Integration for Robust Representation Learning

Figure 3 for Synergy-CLIP: Extending CLIP with Multi-modal Integration for Robust Representation Learning

Figure 4 for Synergy-CLIP: Extending CLIP with Multi-modal Integration for Robust Representation Learning

Abstract:Multi-modal representation learning has become a pivotal area in artificial intelligence, enabling the integration of diverse modalities such as vision, text, and audio to solve complex problems. However, existing approaches predominantly focus on bimodal interactions, such as image-text pairs, which limits their ability to fully exploit the richness of multi-modal data. Furthermore, the integration of modalities in equal-scale environments remains underexplored due to the challenges of constructing large-scale, balanced datasets. In this study, we propose Synergy-CLIP, a novel framework that extends the contrastive language-image pre-training (CLIP) architecture to enhance multi-modal representation learning by integrating visual, textual, and audio modalities. Unlike existing methods that focus on adapting individual modalities to vanilla-CLIP, Synergy-CLIP aligns and captures latent information across three modalities equally. To address the high cost of constructing large-scale multi-modal datasets, we introduce VGG-sound+, a triple-modal dataset designed to provide equal-scale representation of visual, textual, and audio data. Synergy-CLIP is validated on various downstream tasks, including zero-shot classification, where it outperforms existing baselines. Additionally, we introduce a missing modality reconstruction task, demonstrating Synergy-CLIP's ability to extract synergy among modalities in realistic application scenarios. These contributions provide a robust foundation for advancing multi-modal representation learning and exploring new research directions.

* Multi-modal, Multi-modal Representation Learning, Missing Modality, Missing Modality Reconstruction, Speech and Multi-modality, Vision and Language

Via

Access Paper or Ask Questions

BioBridge: Unified Bio-Embedding with Bridging Modality in Code-Switched EMR

Dec 16, 2024

Jangyeong Jeon, Sangyeon Cho, Dongjoon Lee, Changhee Lee, Junyeong Kim

Figure 1 for BioBridge: Unified Bio-Embedding with Bridging Modality in Code-Switched EMR

Figure 2 for BioBridge: Unified Bio-Embedding with Bridging Modality in Code-Switched EMR

Figure 3 for BioBridge: Unified Bio-Embedding with Bridging Modality in Code-Switched EMR

Figure 4 for BioBridge: Unified Bio-Embedding with Bridging Modality in Code-Switched EMR

Abstract:Pediatric Emergency Department (PED) overcrowding presents a significant global challenge, prompting the need for efficient solutions. This paper introduces the BioBridge framework, a novel approach that applies Natural Language Processing (NLP) to Electronic Medical Records (EMRs) in written free-text form to enhance decision-making in PED. In non-English speaking countries, such as South Korea, EMR data is often written in a Code-Switching (CS) format that mixes the native language with English, with most code-switched English words having clinical significance. The BioBridge framework consists of two core modules: "bridging modality in context" and "unified bio-embedding." The "bridging modality in context" module improves the contextual understanding of bilingual and code-switched EMRs. In the "unified bio-embedding" module, the knowledge of the model trained in the medical domain is injected into the encoder-based model to bridge the gap between the medical and general domains. Experimental results demonstrate that the proposed BioBridge significantly performance traditional machine learning and pre-trained encoder-based models on several metrics, including F1 score, area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC), and Brier score. Specifically, BioBridge-XLM achieved enhancements of 0.85% in F1 score, 0.75% in AUROC, and 0.76% in AUPRC, along with a notable 3.04% decrease in the Brier score, demonstrating marked improvements in accuracy, reliability, and prediction calibration over the baseline XLM model. The source code will be made publicly available.

* IEEE Access 2024
* Accepted at IEEE Access 2024

Via

Access Paper or Ask Questions

Zero-Shot Dual-Path Integration Framework for Open-Vocabulary 3D Instance Segmentation

Aug 16, 2024

Tri Ton, Ji Woo Hong, SooHwan Eom, Jun Yeop Shim, Junyeong Kim, Chang D. Yoo

Figure 1 for Zero-Shot Dual-Path Integration Framework for Open-Vocabulary 3D Instance Segmentation

Figure 2 for Zero-Shot Dual-Path Integration Framework for Open-Vocabulary 3D Instance Segmentation

Figure 3 for Zero-Shot Dual-Path Integration Framework for Open-Vocabulary 3D Instance Segmentation

Figure 4 for Zero-Shot Dual-Path Integration Framework for Open-Vocabulary 3D Instance Segmentation

Abstract:Open-vocabulary 3D instance segmentation transcends traditional closed-vocabulary methods by enabling the identification of both previously seen and unseen objects in real-world scenarios. It leverages a dual-modality approach, utilizing both 3D point clouds and 2D multi-view images to generate class-agnostic object mask proposals. Previous efforts predominantly focused on enhancing 3D mask proposal models; consequently, the information that could come from 2D association to 3D was not fully exploited. This bias towards 3D data, while effective for familiar indoor objects, limits the system's adaptability to new and varied object types, where 2D models offer greater utility. Addressing this gap, we introduce Zero-Shot Dual-Path Integration Framework that equally values the contributions of both 3D and 2D modalities. Our framework comprises three components: 3D pathway, 2D pathway, and Dual-Path Integration. 3D pathway generates spatially accurate class-agnostic mask proposals of common indoor objects from 3D point cloud data using a pre-trained 3D model, while 2D pathway utilizes pre-trained open-vocabulary instance segmentation model to identify a diverse array of object proposals from multi-view RGB-D images. In Dual-Path Integration, our Conditional Integration process, which operates in two stages, filters and merges the proposals from both pathways adaptively. This process harmonizes output proposals to enhance segmentation capabilities. Our framework, utilizing pre-trained models in a zero-shot manner, is model-agnostic and demonstrates superior performance on both seen and unseen data, as evidenced by comprehensive evaluations on the ScanNet200 and qualitative results on ARKitScenes datasets.

* OpenSUN 3D: 2nd Workshop on Open-Vocabulary 3D Scene Understanding (CVPR 2024)

Via

Access Paper or Ask Questions

HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

Dec 15, 2023

Sunjae Yoon, Dahyun Kim, Eunseop Yoon, Hee Suk Yoon, Junyeong Kim, Chnag D. Yoo

Figure 1 for HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

Figure 2 for HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

Figure 3 for HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

Figure 4 for HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

Abstract:Video-grounded Dialogue (VGD) aims to answer questions regarding a given multi-modal input comprising video, audio, and dialogue history. Although there have been numerous efforts in developing VGD systems to improve the quality of their responses, existing systems are competent only to incorporate the information in the video and text and tend to struggle in extracting the necessary information from the audio when generating appropriate responses to the question. The VGD system seems to be deaf, and thus, we coin this symptom of current systems' ignoring audio data as a deaf response. To overcome the deaf response problem, Hearing Enhanced Audio Response (HEAR) framework is proposed to perform sensible listening by selectively attending to audio whenever the question requires it. The HEAR framework enhances the accuracy and audibility of VGD systems in a model-agnostic manner. HEAR is validated on VGD datasets (i.e., AVSD@DSTC7 and AVSD@DSTC8) and shows effectiveness with various VGD systems.

* EMNLP 2023, 14 pages, 13 figures

Via

Access Paper or Ask Questions

Information-Theoretic Text Hallucination Reduction for Video-grounded Dialogue

Dec 12, 2022

Sunjae Yoon, Eunseop Yoon, Hee Suk Yoon, Junyeong Kim, Chang D. Yoo

Figure 1 for Information-Theoretic Text Hallucination Reduction for Video-grounded Dialogue

Figure 2 for Information-Theoretic Text Hallucination Reduction for Video-grounded Dialogue

Figure 3 for Information-Theoretic Text Hallucination Reduction for Video-grounded Dialogue

Figure 4 for Information-Theoretic Text Hallucination Reduction for Video-grounded Dialogue

Abstract:Video-grounded Dialogue (VGD) aims to decode an answer sentence to a question regarding a given video and dialogue context. Despite the recent success of multi-modal reasoning to generate answer sentences, existing dialogue systems still suffer from a text hallucination problem, which denotes indiscriminate text-copying from input texts without an understanding of the question. This is due to learning spurious correlations from the fact that answer sentences in the dataset usually include the words of input texts, thus the VGD system excessively relies on copying words from input texts by hoping those words to overlap with ground-truth texts. Hence, we design Text Hallucination Mitigating (THAM) framework, which incorporates Text Hallucination Regularization (THR) loss derived from the proposed information-theoretic text hallucination measurement approach. Applying THAM with current dialogue systems validates the effectiveness on VGD benchmarks (i.e., AVSD@DSTC7 and AVSD@DSTC8) and shows enhanced interpretability.

* 12 pages, Accepted in EMNLP 2022

Via

Access Paper or Ask Questions

Selective Query-guided Debiasing Network for Video Corpus Moment Retrieval

Oct 17, 2022

Sunjae Yoon, Ji Woo Hong, Eunseop Yoon, Dahyun Kim, Junyeong Kim, Hee Suk Yoon, Chang D. Yoo

Figure 1 for Selective Query-guided Debiasing Network for Video Corpus Moment Retrieval

Figure 2 for Selective Query-guided Debiasing Network for Video Corpus Moment Retrieval

Figure 3 for Selective Query-guided Debiasing Network for Video Corpus Moment Retrieval

Figure 4 for Selective Query-guided Debiasing Network for Video Corpus Moment Retrieval

Abstract:Video moment retrieval (VMR) aims to localize target moments in untrimmed videos pertinent to a given textual query. Existing retrieval systems tend to rely on retrieval bias as a shortcut and thus, fail to sufficiently learn multi-modal interactions between query and video. This retrieval bias stems from learning frequent co-occurrence patterns between query and moments, which spuriously correlate objects (e.g., a pencil) referred in the query with moments (e.g., scene of writing with a pencil) where the objects frequently appear in the video, such that they converge into biased moment predictions. Although recent debiasing methods have focused on removing this retrieval bias, we argue that these biased predictions sometimes should be preserved because there are many queries where biased predictions are rather helpful. To conjugate this retrieval bias, we propose a Selective Query-guided Debiasing network (SQuiDNet), which incorporates the following two main properties: (1) Biased Moment Retrieval that intentionally uncovers the biased moments inherent in objects of the query and (2) Selective Query-guided Debiasing that performs selective debiasing guided by the meaning of the query. Our experimental results on three moment retrieval benchmarks (i.e., TVR, ActivityNet, DiDeMo) show the effectiveness of SQuiDNet and qualitative analysis shows improved interpretability.

* 16 pages, 6 figures, Accepted in ECCV 2022

Via

Access Paper or Ask Questions

SoftGroup++: Scalable 3D Instance Segmentation with Octree Pyramid Grouping

Sep 17, 2022

Thang Vu, Kookhoi Kim, Tung M. Luu, Thanh Nguyen, Junyeong Kim, Chang D. Yoo

Figure 1 for SoftGroup++: Scalable 3D Instance Segmentation with Octree Pyramid Grouping

Figure 2 for SoftGroup++: Scalable 3D Instance Segmentation with Octree Pyramid Grouping

Figure 3 for SoftGroup++: Scalable 3D Instance Segmentation with Octree Pyramid Grouping

Figure 4 for SoftGroup++: Scalable 3D Instance Segmentation with Octree Pyramid Grouping

Abstract:Existing state-of-the-art 3D point cloud instance segmentation methods rely on a grouping-based approach that groups points to obtain object instances. Despite improvement in producing accurate segmentation results, these methods lack scalability and commonly require dividing large input into multiple parts. To process a scene with millions of points, the existing fastest method SoftGroup \cite{vu2022softgroup} requires tens of seconds, which is under satisfaction. Our finding is that $k$-Nearest Neighbor ($k$-NN), which serves as the prerequisite of grouping, is a computational bottleneck. This bottleneck severely worsens the inference time in the scene with a large number of points. This paper proposes SoftGroup++ to address this computational bottleneck and further optimize the inference speed of the whole network. SoftGroup++ is built upon SoftGroup, which differs in three important aspects: (1) performs octree $k$-NN instead of vanilla $k$-NN to reduce time complexity from $\mathcal{O}(n^2)$ to $\mathcal{O}(n \log n)$, (2) performs pyramid scaling that adaptively downsamples backbone outputs to reduce search space for $k$-NN and grouping, and (3) performs late devoxelization that delays the conversion from voxels to points towards the end of the model such that intermediate components operate at a low computational cost. Extensive experiments on various indoor and outdoor datasets demonstrate the efficacy of the proposed SoftGroup++. Notably, SoftGroup++ processes large scenes of millions of points by a single forward without dividing the input into multiple parts, thus enriching contextual information. Especially, SoftGroup++ achieves 2.4 points AP$_{50}$ improvement while nearly $6\times$ faster than the existing fastest method on S3DIS dataset. The code and trained models will be made publicly available.

* Technical report

Via

Access Paper or Ask Questions

Structured Co-reference Graph Attention for Video-grounded Dialogue

Mar 24, 2021

Junyeong Kim, Sunjae Yoon, Dahyun Kim, Chang D. Yoo

Figure 1 for Structured Co-reference Graph Attention for Video-grounded Dialogue

Figure 2 for Structured Co-reference Graph Attention for Video-grounded Dialogue

Figure 3 for Structured Co-reference Graph Attention for Video-grounded Dialogue

Figure 4 for Structured Co-reference Graph Attention for Video-grounded Dialogue

Abstract:A video-grounded dialogue system referred to as the Structured Co-reference Graph Attention (SCGA) is presented for decoding the answer sequence to a question regarding a given video while keeping track of the dialogue context. Although recent efforts have made great strides in improving the quality of the response, performance is still far from satisfactory. The two main challenging issues are as follows: (1) how to deduce co-reference among multiple modalities and (2) how to reason on the rich underlying semantic structure of video with complex spatial and temporal dynamics. To this end, SCGA is based on (1) Structured Co-reference Resolver that performs dereferencing via building a structured graph over multiple modalities, (2) Spatio-temporal Video Reasoner that captures local-to-global dynamics of video via gradually neighboring graph attention. SCGA makes use of pointer network to dynamically replicate parts of the question for decoding the answer sequence. The validity of the proposed SCGA is demonstrated on AVSD@DSTC7 and AVSD@DSTC8 datasets, a challenging video-grounded dialogue benchmarks, and TVQA dataset, a large-scale videoQA benchmark. Our empirical results show that SCGA outperforms other state-of-the-art dialogue systems on both benchmarks, while extensive ablation study and qualitative analysis reveal performance gain and improved interpretability.

* Accepted to AAAI2021

Via

Access Paper or Ask Questions

VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

Aug 24, 2020

Minuk Ma, Sunjae Yoon, Junyeong Kim, Youngjoon Lee, Sunghun Kang, Chang D. Yoo

Figure 1 for VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

Figure 2 for VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

Figure 3 for VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

Figure 4 for VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

Abstract:Video Moment Retrieval (VMR) is a task to localize the temporal moment in untrimmed video specified by natural language query. For VMR, several methods that require full supervision for training have been proposed. Unfortunately, acquiring a large number of training videos with labeled temporal boundaries for each query is a labor-intensive process. This paper explores methods for performing VMR in a weakly-supervised manner (wVMR): training is performed without temporal moment labels but only with the text query that describes a segment of the video. Existing methods on wVMR generate multi-scale proposals and apply query-guided attention mechanisms to highlight the most relevant proposal. To leverage the weak supervision, contrastive learning is used which predicts higher scores for the correct video-query pairs than for the incorrect pairs. It has been observed that a large number of candidate proposals, coarse query representation, and one-way attention mechanism lead to blurry attention maps which limit the localization performance. To handle this issue, Video-Language Alignment Network (VLANet) is proposed that learns sharper attention by pruning out spurious candidate proposals and applying a multi-directional attention mechanism with fine-grained query representation. The Surrogate Proposal Selection module selects a proposal based on the proximity to the query in the joint embedding space, and thus substantially reduces candidate proposals which leads to lower computation load and sharper attention. Next, the Cascaded Cross-modal Attention module considers dense feature interactions and multi-directional attention flow to learn the multi-modal alignment. VLANet is trained end-to-end using contrastive loss which enforces semantically similar videos and queries to gather. The experiments show that the method achieves state-of-the-art performance on Charades-STA and DiDeMo datasets.

* 16 pages, 6 figures, European Conference on Computer Vision, 2020

Via

Access Paper or Ask Questions

Modality Shifting Attention Network for Multi-modal Video Question Answering

Jul 04, 2020

Junyeong Kim, Minuk Ma, Trung Pham, Kyungsu Kim, Chang D. Yoo

Figure 1 for Modality Shifting Attention Network for Multi-modal Video Question Answering

Figure 2 for Modality Shifting Attention Network for Multi-modal Video Question Answering

Figure 3 for Modality Shifting Attention Network for Multi-modal Video Question Answering

Figure 4 for Modality Shifting Attention Network for Multi-modal Video Question Answering

Abstract:This paper considers a network referred to as Modality Shifting Attention Network (MSAN) for Multimodal Video Question Answering (MVQA) task. MSAN decomposes the task into two sub-tasks: (1) localization of temporal moment relevant to the question, and (2) accurate prediction of the answer based on the localized moment. The modality required for temporal localization may be different from that for answer prediction, and this ability to shift modality is essential for performing the task. To this end, MSAN is based on (1) the moment proposal network (MPN) that attempts to locate the most appropriate temporal moment from each of the modalities, and also on (2) the heterogeneous reasoning network (HRN) that predicts the answer using an attention mechanism on both modalities. MSAN is able to place importance weight on the two modalities for each sub-task using a component referred to as Modality Importance Modulation (MIM). Experimental results show that MSAN outperforms previous state-of-the-art by achieving 71.13\% test accuracy on TVQA benchmark dataset. Extensive ablation studies and qualitative analysis are conducted to validate various components of the network.

* CVPR2020 accepted; poster

Via

Access Paper or Ask Questions