Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ho-Jin Choi

LANGALIGN: Enhancing Non-English Language Models via Cross-Lingual Embedding Alignment

Mar 25, 2025

Jong Myoung Kim, Young-Jun Lee, Ho-Jin Choi, Sangkeun Jung

Abstract:While Large Language Models have gained attention, many service developers still rely on embedding-based models due to practical constraints. In such cases, the quality of fine-tuning data directly impacts performance, and English datasets are often used as seed data for training non-English models. In this study, we propose LANGALIGN, which enhances target language processing by aligning English embedding vectors with those of the target language at the interface between the language model and the task header. Experiments on Korean, Japanese, and Chinese demonstrate that LANGALIGN significantly improves performance across all three languages. Additionally, we show that LANGALIGN can be applied in reverse to convert target language data into a format that an English-based model can process.

* now preparing

Via

Access Paper or Ask Questions

Thanos: Enhancing Conversational Agents with Skill-of-Mind-Infused Large Language Model

Nov 07, 2024

Young-Jun Lee, Dokyong Lee, Junyoung Youn, Kyeongjin Oh, Ho-Jin Choi

Abstract:To increase social bonding with interlocutors, humans naturally acquire the ability to respond appropriately in a given situation by considering which conversational skill is most suitable for the response - a process we call skill-of-mind. For large language model (LLM)-based conversational agents, planning appropriate conversational skills, as humans do, is challenging due to the complexity of social dialogue, especially in interactive scenarios. To address this, we propose a skill-of-mind-annotated conversation dataset, named Multifaceted Skill-of-Mind, which includes multi-turn and multifaceted conversational skills across various interactive scenarios (e.g., long-term, counseling, task-oriented), grounded in diverse social contexts (e.g., demographics, persona, rules of thumb). This dataset consists of roughly 100K conversations. Using this dataset, we introduce a new family of skill-of-mind-infused LLMs, named Thanos, with model sizes of 1B, 3B, and 8B parameters. With extensive experiments, these models successfully demonstrate the skill-of-mind process and exhibit strong generalizability in inferring multifaceted skills across a variety of domains. Moreover, we show that Thanos significantly enhances the quality of responses generated by LLM-based conversational agents and promotes prosocial behavior in human evaluations.

* Code: https://github.com/passing2961/Thanos

Via

Access Paper or Ask Questions

Enhancing Speech Emotion Recognition through Segmental Average Pooling of Self-Supervised Learning Features

Oct 16, 2024

Jonghwan Hyeon, Yung-Hwan Oh, Ho-Jin Choi

Figure 1 for Enhancing Speech Emotion Recognition through Segmental Average Pooling of Self-Supervised Learning Features

Figure 2 for Enhancing Speech Emotion Recognition through Segmental Average Pooling of Self-Supervised Learning Features

Figure 3 for Enhancing Speech Emotion Recognition through Segmental Average Pooling of Self-Supervised Learning Features

Figure 4 for Enhancing Speech Emotion Recognition through Segmental Average Pooling of Self-Supervised Learning Features

Abstract:Speech Emotion Recognition (SER) analyzes human emotions expressed through speech. Self-supervised learning (SSL) offers a promising approach to SER by learning meaningful representations from a large amount of unlabeled audio data. However, existing SSL-based methods rely on Global Average Pooling (GAP) to represent audio signals, treating speech and non-speech segments equally. This can lead to dilution of informative speech features by irrelevant non-speech information. To address this, the paper proposes Segmental Average Pooling (SAP), which selectively focuses on informative speech segments while ignoring non-speech segments. By applying both GAP and SAP to SSL features, our approach utilizes overall speech signal information from GAP and specific information from SAP, leading to improved SER performance. Experiments show state-of-the-art results on the IEMOCAP for English and superior performance on KEMDy19 for Korean datasets in both unweighted and weighted accuracies.

Via

Access Paper or Ask Questions

Intriguing Properties of Large Language and Vision Models

Oct 07, 2024

Young-Jun Lee, Byungsoo Ko, Han-Gyu Kim, Yechan Hwang, Ho-Jin Choi

Abstract:Recently, large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance across a wide range of tasks requiring perception and cognitive abilities. A key factor behind their success is their simple architecture, which consists of a vision encoder, a projector, and a large language model (LLM). Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks (e.g., MMVP) remains surprisingly low. This discrepancy raises the question of how LLVMs truly perceive images and exploit the advantages of the vision encoder. To address this, we systematically investigate this question regarding several aspects: permutation invariance, robustness, math reasoning, alignment preserving and importance, by evaluating the most common LLVM's families (i.e., LLaVA) across 10 evaluation benchmarks. Our extensive experiments reveal several intriguing properties of current LLVMs: (1) they internally process the image in a global manner, even when the order of visual patch sequences is randomly permuted; (2) they are sometimes able to solve math problems without fully perceiving detailed numerical information; (3) the cross-modal alignment is overfitted to complex reasoning tasks, thereby, causing them to lose some of the original perceptual capabilities of their vision encoder; (4) the representation space in the lower layers (<25%) plays a crucial role in determining performance and enhancing visual understanding. Lastly, based on the above observations, we suggest potential future directions for building better LLVMs and constructing more challenging evaluation benchmarks.

* Code is available in https://github.com/passing2961/IP-LLVM

Via

Access Paper or Ask Questions

Does Incomplete Syntax Influence Korean Language Model? Focusing on Word Order and Case Markers

Jul 12, 2024

Jong Myoung Kim, Young-Jun Lee, Yong-jin Han, Sangkeun Jung, Ho-Jin Choi

Figure 1 for Does Incomplete Syntax Influence Korean Language Model? Focusing on Word Order and Case Markers

Figure 2 for Does Incomplete Syntax Influence Korean Language Model? Focusing on Word Order and Case Markers

Figure 3 for Does Incomplete Syntax Influence Korean Language Model? Focusing on Word Order and Case Markers

Figure 4 for Does Incomplete Syntax Influence Korean Language Model? Focusing on Word Order and Case Markers

Abstract:Syntactic elements, such as word order and case markers, are fundamental in natural language processing. Recent studies show that syntactic information boosts language model performance and offers clues for people to understand their learning mechanisms. Unlike languages with a fixed word order such as English, Korean allows for varied word sequences, despite its canonical structure, due to case markers that indicate the functions of sentence components. This study explores whether Korean language models can accurately capture this flexibility. We note that incomplete word orders and omitted case markers frequently appear in ordinary Korean communication. To investigate this further, we introduce the Syntactically Incomplete Korean (SIKO) dataset. Through SIKO, we assessed Korean language models' flexibility with incomplete syntax and confirmed the dataset's training value. Results indicate these models reflect Korean's inherent flexibility, accurately handling incomplete inputs. Moreover, fine-tuning with SIKO enhances the ability to handle common incomplete Korean syntactic forms. The dataset's simple construction process, coupled with significant performance enhancements, solidifies its standing as an effective data augmentation technique.

* COLM 2024; Code and dataset is available in https://github.com/grayapple-git/SIKO

Via

Access Paper or Ask Questions

Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge

Jul 04, 2024

Young-Jun Lee, Dokyong Lee, Junyoung Youn, Kyeongjin Oh, Byungsoo Ko, Jonghwan Hyeon, Ho-Jin Choi

Figure 1 for Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge

Figure 2 for Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge

Figure 3 for Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge

Figure 4 for Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge

Abstract:Humans share a wide variety of images related to their personal experiences within conversations via instant messaging tools. However, existing works focus on (1) image-sharing behavior in singular sessions, leading to limited long-term social interaction, and (2) a lack of personalized image-sharing behavior. In this work, we introduce Stark, a large-scale long-term multi-modal conversation dataset that covers a wide range of social personas in a multi-modality format, time intervals, and images. To construct Stark automatically, we propose a novel multi-modal contextualization framework, Mcu, that generates long-term multi-modal dialogue distilled from ChatGPT and our proposed Plan-and-Execute image aligner. Using our Stark, we train a multi-modal conversation model, Ultron 7B, which demonstrates impressive visual imagination ability. Furthermore, we demonstrate the effectiveness of our dataset in human evaluation. We make our source code and dataset publicly available.

* Project website: https://stark-dataset.github.io

Via

Access Paper or Ask Questions

Large Language Models can Share Images, Too!

Oct 23, 2023

Young-Jun Lee, Jonghwan Hyeon, Ho-Jin Choi

Figure 1 for Large Language Models can Share Images, Too!

Figure 2 for Large Language Models can Share Images, Too!

Figure 3 for Large Language Models can Share Images, Too!

Figure 4 for Large Language Models can Share Images, Too!

Abstract:This paper explores the image-sharing capability of Large Language Models (LLMs), such as InstructGPT, ChatGPT, and GPT-4, in a zero-shot setting, without the help of visual foundation models. Inspired by the two-stage process of image-sharing in human dialogues, we propose a two-stage framework that allows LLMs to predict potential image-sharing turns and generate related image descriptions using our effective restriction-based prompt template. With extensive experiments, we unlock the \textit{image-sharing} capability of LLMs in zero-shot prompting, with GPT-4 achieving the best performance. Additionally, we uncover the emergent \textit{image-sharing} ability in zero-shot prompting, demonstrating the effectiveness of restriction-based prompts in both stages of our framework. Based on this framework, we augment the PhotoChat dataset with images generated by Stable Diffusion at predicted turns, namely PhotoChat++. To our knowledge, this is the first study to assess the image-sharing ability of LLMs in a zero-shot setting without visual foundation models. The source code and the dataset will be released after publication.

Via

Access Paper or Ask Questions

Towards Interpretable Controllability in Object-Centric Learning

Oct 16, 2023

Jinwoo Kim, Janghyuk Choi, Jaehyun Kang, Changyeon Lee, Ho-Jin Choi, Seon Joo Kim

Abstract:The binding problem in artificial neural networks is actively explored with the goal of achieving human-level recognition skills through the comprehension of the world in terms of symbol-like entities. Especially in the field of computer vision, object-centric learning (OCL) is extensively researched to better understand complex scenes by acquiring object representations or slots. While recent studies in OCL have made strides with complex images or videos, the interpretability and interactivity over object representation remain largely uncharted, still holding promise in the field of OCL. In this paper, we introduce a novel method, Slot Attention with Image Augmentation (SlotAug), to explore the possibility of learning interpretable controllability over slots in a self-supervised manner by utilizing an image augmentation strategy. We also devise the concept of sustainability in controllable slots by introducing iterative and reversible controls over slots with two proposed submethods: Auxiliary Identity Manipulation and Slot Consistency Loss. Extensive empirical studies and theoretical validation confirm the effectiveness of our approach, offering a novel capability for interpretable and sustainable control of object representations. Code will be available soon.

Via

Access Paper or Ask Questions

Shepherding Slots to Objects: Towards Stable and Robust Object-Centric Learning

Mar 31, 2023

Jinwoo Kim, Janghyuk Choi, Ho-Jin Choi, Seon Joo Kim

Figure 1 for Shepherding Slots to Objects: Towards Stable and Robust Object-Centric Learning

Figure 2 for Shepherding Slots to Objects: Towards Stable and Robust Object-Centric Learning

Figure 3 for Shepherding Slots to Objects: Towards Stable and Robust Object-Centric Learning

Figure 4 for Shepherding Slots to Objects: Towards Stable and Robust Object-Centric Learning

Abstract:Object-centric learning (OCL) aspires general and compositional understanding of scenes by representing a scene as a collection of object-centric representations. OCL has also been extended to multi-view image and video datasets to apply various data-driven inductive biases by utilizing geometric or temporal information in the multi-image data. Single-view images carry less information about how to disentangle a given scene than videos or multi-view images do. Hence, owing to the difficulty of applying inductive biases, OCL for single-view images remains challenging, resulting in inconsistent learning of object-centric representation. To this end, we introduce a novel OCL framework for single-view images, SLot Attention via SHepherding (SLASH), which consists of two simple-yet-effective modules on top of Slot Attention. The new modules, Attention Refining Kernel (ARK) and Intermediate Point Predictor and Encoder (IPPE), respectively, prevent slots from being distracted by the background noise and indicate locations for slots to focus on to facilitate learning of object-centric representation. We also propose a weak semi-supervision approach for OCL, whilst our proposed framework can be used without any assistant annotation during the inference. Experiments show that our proposed method enables consistent learning of object-centric representation and achieves strong performance across four datasets. Code is available at \url{https://github.com/object-understanding/SLASH}.

Via

Access Paper or Ask Questions

DialogCC: Large-Scale Multi-Modal Dialogue Dataset

Dec 08, 2022

Young-Jun Lee, Byungsoo Ko, Han-Gyu Kim, Ho-Jin Choi

Figure 1 for DialogCC: Large-Scale Multi-Modal Dialogue Dataset

Figure 2 for DialogCC: Large-Scale Multi-Modal Dialogue Dataset

Figure 3 for DialogCC: Large-Scale Multi-Modal Dialogue Dataset

Figure 4 for DialogCC: Large-Scale Multi-Modal Dialogue Dataset

Abstract:As sharing images in an instant message is a crucial factor, there has been active research on learning a image-text multi-modal dialogue model. However, training a well-generalized multi-modal dialogue model is challenging because existing multi-modal dialogue datasets contain a small number of data, limited topics, and a restricted variety of images per dialogue. In this paper, we present a multi-modal dialogue dataset creation pipeline that involves matching large-scale images to dialogues based on CLIP similarity. Using this automatic pipeline, we propose a large-scale multi-modal dialogue dataset, DialogCC, which covers diverse real-world topics and various images per dialogue. With extensive experiments, we demonstrate that training a multi-modal dialogue model with our dataset can improve generalization performance. Additionally, existing models trained with our dataset achieve state-of-the-art performance on image and text retrieval tasks. The source code and the dataset will be released after publication.

Via

Access Paper or Ask Questions