Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiyeon Kim

Let's Predict Sentence by Sentence

May 28, 2025

Hyeonbin Hwang, Byeongguk Jeon, Seungone Kim, Jiyeon Kim, Hoyeon Chang, Sohee Yang, Seungpil Won, Dohaeng Lee, Youbin Ahn, Minjoon Seo

Abstract:Autoregressive language models (LMs) generate one token at a time, yet human reasoning operates over higher-level abstractions - sentences, propositions, and concepts. This contrast raises a central question- Can LMs likewise learn to reason over structured semantic units rather than raw token sequences? In this work, we investigate whether pretrained LMs can be lifted into such abstract reasoning spaces by building on their learned representations. We present a framework that adapts a pretrained token-level LM to operate in sentence space by autoregressively predicting continuous embeddings of next sentences. We explore two embedding paradigms inspired by classical representation learning: 1) semantic embeddings, learned via autoencoding to preserve surface meaning; and 2) contextual embeddings, trained via next-sentence prediction to encode anticipatory structure. We evaluate both under two inference regimes: Discretized, which decodes each predicted embedding into text before re-encoding; and Continuous, which reasons entirely in embedding space for improved efficiency. Across four domains - mathematics, logic, commonsense, and planning - contextual embeddings under continuous inference show competitive performance with Chain-of-Thought (CoT) while reducing inference-time FLOPs on average by half. We also present early signs of scalability and modular adaptation. Finally, to visualize latent trajectories, we introduce SentenceLens, a diagnostic tool that decodes intermediate model states into interpretable sentences. Together, our results indicate that pretrained LMs can effectively transition to abstract, structured reasoning within latent embedding spaces.

* Work In Progress

Via

Access Paper or Ask Questions

How Does Vision-Language Adaptation Impact the Safety of Vision Language Models?

Oct 10, 2024

Seongyun Lee, Geewook Kim, Jiyeon Kim, Hyunji Lee, Hoyeon Chang, Sue Hyun Park, Minjoon Seo

Figure 1 for How Does Vision-Language Adaptation Impact the Safety of Vision Language Models?

Figure 2 for How Does Vision-Language Adaptation Impact the Safety of Vision Language Models?

Figure 3 for How Does Vision-Language Adaptation Impact the Safety of Vision Language Models?

Figure 4 for How Does Vision-Language Adaptation Impact the Safety of Vision Language Models?

Abstract:Vision-Language adaptation (VL adaptation) transforms Large Language Models (LLMs) into Large Vision-Language Models (LVLMs) for multimodal tasks, but this process often compromises the inherent safety capabilities embedded in the original LLMs. Despite potential harmfulness due to weakened safety measures, in-depth analysis on the effects of VL adaptation on safety remains under-explored. This study examines how VL adaptation influences safety and evaluates the impact of safety fine-tuning methods. Our analysis reveals that safety degradation occurs during VL adaptation, even when the training data is safe. While safety tuning techniques like supervised fine-tuning with safety datasets or reinforcement learning from human feedback mitigate some risks, they still lead to safety degradation and a reduction in helpfulness due to over-rejection issues. Further analysis of internal model weights suggests that VL adaptation may impact certain safety-related layers, potentially lowering overall safety levels. Additionally, our findings demonstrate that the objectives of VL adaptation and safety tuning are divergent, which often results in their simultaneous application being suboptimal. To address this, we suggest the weight merging approach as an optimal solution effectively reducing safety degradation while maintaining helpfulness. These insights help guide the development of more reliable and secure LVLMs for real-world applications.

Via

Access Paper or Ask Questions

Knowledge Entropy Decay during Language Model Pretraining Hinders New Knowledge Acquisition

Oct 02, 2024

Jiyeon Kim, Hyunji Lee, Hyowon Cho, Joel Jang, Hyeonbin Hwang, Seungpil Won, Youbin Ahn, Dohaeng Lee, Minjoon Seo

Figure 1 for Knowledge Entropy Decay during Language Model Pretraining Hinders New Knowledge Acquisition

Figure 2 for Knowledge Entropy Decay during Language Model Pretraining Hinders New Knowledge Acquisition

Figure 3 for Knowledge Entropy Decay during Language Model Pretraining Hinders New Knowledge Acquisition

Figure 4 for Knowledge Entropy Decay during Language Model Pretraining Hinders New Knowledge Acquisition

Abstract:In this work, we investigate how a model's tendency to broadly integrate its parametric knowledge evolves throughout pretraining, and how this behavior affects overall performance, particularly in terms of knowledge acquisition and forgetting. We introduce the concept of knowledge entropy, which quantifies the range of memory sources the model engages with; high knowledge entropy indicates that the model utilizes a wide range of memory sources, while low knowledge entropy suggests reliance on specific sources with greater certainty. Our analysis reveals a consistent decline in knowledge entropy as pretraining advances. We also find that the decline is closely associated with a reduction in the model's ability to acquire and retain knowledge, leading us to conclude that diminishing knowledge entropy (smaller number of active memory sources) impairs the model's knowledge acquisition and retention capabilities. We find further support for this by demonstrating that increasing the activity of inactive memory sources enhances the model's capacity for knowledge acquisition and retention.

Via

Access Paper or Ask Questions

PyGRF: An improved Python Geographical Random Forest model and case studies in public health and natural disasters

Sep 20, 2024

Kai Sun, Ryan Zhenqi Zhou, Jiyeon Kim, Yingjie Hu

Abstract:Geographical random forest (GRF) is a recently developed and spatially explicit machine learning model. With the ability to provide more accurate predictions and local interpretations, GRF has already been used in many studies. The current GRF model, however, has limitations in its determination of the local model weight and bandwidth hyperparameters, potentially insufficient numbers of local training samples, and sometimes high local prediction errors. Also, implemented as an R package, GRF currently does not have a Python version which limits its adoption among machine learning practitioners who prefer Python. This work addresses these limitations by introducing theory-informed hyperparameter determination, local training sample expansion, and spatially-weighted local prediction. We also develop a Python-based GRF model and package, PyGRF, to facilitate the use of the model. We evaluate the performance of PyGRF on an example dataset and further demonstrate its use in two case studies in public health and natural disasters.

* Transactions in GIS, 2024

Via

Access Paper or Ask Questions

ListT5: Listwise Reranking with Fusion-in-Decoder Improves Zero-shot Retrieval

Feb 28, 2024

Soyoung Yoon, Eunbi Choi, Jiyeon Kim, Yireun Kim, Hyeongu Yun, Seung-won Hwang

Abstract:We propose ListT5, a novel reranking approach based on Fusion-in-Decoder (FiD) that handles multiple candidate passages at both train and inference time. We also introduce an efficient inference framework for listwise ranking based on m-ary tournament sort with output caching. We evaluate and compare our model on the BEIR benchmark for zero-shot retrieval task, demonstrating that ListT5 (1) outperforms the state-of-the-art RankT5 baseline with a notable +1.3 gain in the average NDCG@10 score, (2) has an efficiency comparable to pointwise ranking models and surpasses the efficiency of previous listwise ranking models, and (3) overcomes the lost-in-the-middle problem of previous listwise rerankers. Our code, model checkpoints, and the evaluation framework are fully open-sourced at \url{https://github.com/soyoung97/ListT5}.

Via

Access Paper or Ask Questions

Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech

Jan 19, 2024

Abhinav Garg, Jiyeon Kim, Sushil Khyalia, Chanwoo Kim, Dhananjaya Gowda

Abstract:Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, usually, ARPABET or IPA, which might not be the most optimal way to represent phonemes for all languages. Secondly, the man-hours required to produce such an expert lexicon are very high. In this paper, we eliminate both of these issues by using recent advances in self-supervised learning to obtain data-driven phoneme representations instead of fixed representations. We compare our lexicon-free approach against strong baselines that utilize a well-crafted lexicon. Furthermore, we show that our data-driven lexicon-free method performs as good or even marginally better than the conventional rule-based or lexicon-based neural G2Ps in terms of Mean Opinion Score (MOS) while using no prior language lexicon or phoneme set, i.e. no linguistic expertise.

* Accepted at ICASSP 2024

Via

Access Paper or Ask Questions

A Unified Approach for Comprehensive Analysis of Various Spectral and Tissue Doppler Echocardiography

Nov 14, 2023

Jaeik Jeon, Jiyeon Kim, Yeonggul Jang, Yeonyee E. Yoon, Dawun Jeong, Youngtaek Hong, Seung-Ah Lee, Hyuk-Jae Chang

Figure 1 for A Unified Approach for Comprehensive Analysis of Various Spectral and Tissue Doppler Echocardiography

Figure 2 for A Unified Approach for Comprehensive Analysis of Various Spectral and Tissue Doppler Echocardiography

Figure 3 for A Unified Approach for Comprehensive Analysis of Various Spectral and Tissue Doppler Echocardiography

Figure 4 for A Unified Approach for Comprehensive Analysis of Various Spectral and Tissue Doppler Echocardiography

Abstract:Doppler echocardiography offers critical insights into cardiac function and phases by quantifying blood flow velocities and evaluating myocardial motion. However, previous methods for automating Doppler analysis, ranging from initial signal processing techniques to advanced deep learning approaches, have been constrained by their reliance on electrocardiogram (ECG) data and their inability to process Doppler views collectively. We introduce a novel unified framework using a convolutional neural network for comprehensive analysis of spectral and tissue Doppler echocardiography images that combines automatic measurements and end-diastole (ED) detection into a singular method. The network automatically recognizes key features across various Doppler views, with novel Doppler shape embedding and anti-aliasing modules enhancing interpretation and ensuring consistent analysis. Empirical results indicate a consistent outperformance in performance metrics, including dice similarity coefficients (DSC) and intersection over union (IoU). The proposed framework demonstrates strong agreement with clinicians in Doppler automatic measurements and competitive performance in ED detection.

Via

Access Paper or Ask Questions

Echocardiographic View Classification with Integrated Out-of-Distribution Detection for Enhanced Automatic Echocardiographic Analysis

Aug 31, 2023

Jaeik Jeon, Seongmin Ha, Yeonyee E. Yoon, Jiyeon Kim, Hyunseok Jeong, Dawun Jeong, Yeonggul Jang, Youngtaek Hong, Hyuk-Jae Chang

Abstract:In the rapidly evolving field of automatic echocardiographic analysis and interpretation, automatic view classification is a critical yet challenging task, owing to the inherent complexity and variability of echocardiographic data. This study presents ECHOcardiography VIew Classification with Out-of-Distribution dEtection (ECHO-VICODE), a novel deep learning-based framework that effectively addresses this challenge by training to classify 31 classes, surpassing previous studies and demonstrating its capacity to handle a wide range of echocardiographic views. Furthermore, ECHO-VICODE incorporates an integrated out-of-distribution (OOD) detection function, leveraging the relative Mahalanobis distance to effectively identify 'near-OOD' instances commonly encountered in echocardiographic data. Through extensive experimentation, we demonstrated the outstanding performance of ECHO-VICODE in terms of view classification and OOD detection, significantly reducing the potential for errors in echocardiographic analyses. This pioneering study significantly advances the domain of automated echocardiography analysis and exhibits promising prospects for substantial applications in extensive clinical research and practice.

Via

Access Paper or Ask Questions

Semi-supervised transfer learning for language expansion of end-to-end speech recognition models to low-resource languages

Nov 19, 2021

Jiyeon Kim, Mehul Kumar, Dhananjaya Gowda, Abhinav Garg, Chanwoo Kim

Figure 1 for Semi-supervised transfer learning for language expansion of end-to-end speech recognition models to low-resource languages

Figure 2 for Semi-supervised transfer learning for language expansion of end-to-end speech recognition models to low-resource languages

Figure 3 for Semi-supervised transfer learning for language expansion of end-to-end speech recognition models to low-resource languages

Figure 4 for Semi-supervised transfer learning for language expansion of end-to-end speech recognition models to low-resource languages

Abstract:In this paper, we propose a three-stage training methodology to improve the speech recognition accuracy of low-resource languages. We explore and propose an effective combination of techniques such as transfer learning, encoder freezing, data augmentation using Text-To-Speech (TTS), and Semi-Supervised Learning (SSL). To improve the accuracy of a low-resource Italian ASR, we leverage a well-trained English model, unlabeled text corpus, and unlabeled audio corpus using transfer learning, TTS augmentation, and SSL respectively. In the first stage, we use transfer learning from a well-trained English model. This primarily helps in learning the acoustic information from a resource-rich language. This stage achieves around 24% relative Word Error Rate (WER) reduction over the baseline. In stage two, We utilize unlabeled text data via TTS data-augmentation to incorporate language information into the model. We also explore freezing the acoustic encoder at this stage. TTS data augmentation helps us further reduce the WER by ~ 21% relatively. Finally, In stage three we reduce the WER by another 4% relative by using SSL from unlabeled audio data. Overall, our two-pass speech recognition system with a Monotonic Chunkwise Attention (MoChA) in the first pass and a full-attention in the second pass achieves a WER reduction of ~ 42% relative to the baseline.

* Accepted as a conference paper at ASRU 2021

Via

Access Paper or Ask Questions

A comparison of streaming models and data augmentation methods for robust speech recognition

Nov 19, 2021

Jiyeon Kim, Mehul Kumar, Dhananjaya Gowda, Abhinav Garg, Chanwoo Kim

Figure 1 for A comparison of streaming models and data augmentation methods for robust speech recognition

Figure 2 for A comparison of streaming models and data augmentation methods for robust speech recognition

Figure 3 for A comparison of streaming models and data augmentation methods for robust speech recognition

Figure 4 for A comparison of streaming models and data augmentation methods for robust speech recognition

Abstract:In this paper, we present a comparative study on the robustness of two different online streaming speech recognition models: Monotonic Chunkwise Attention (MoChA) and Recurrent Neural Network-Transducer (RNN-T). We explore three recently proposed data augmentation techniques, namely, multi-conditioned training using an acoustic simulator, Vocal Tract Length Perturbation (VTLP) for speaker variability, and SpecAugment. Experimental results show that unidirectional models are in general more sensitive to noisy examples in the training set. It is observed that the final performance of the model depends on the proportion of training examples processed by data augmentation techniques. MoChA models generally perform better than RNN-T models. However, we observe that training of MoChA models seems to be more sensitive to various factors such as the characteristics of training sets and the incorporation of additional augmentations techniques. On the other hand, RNN-T models perform better than MoChA models in terms of latency, inference time, and the stability of training. Additionally, RNN-T models are generally more robust against noise and reverberation. All these advantages make RNN-T models a better choice for streaming on-device speech recognition compared to MoChA models.

* Accepted as a conference paper at ASRU 2021

Via

Access Paper or Ask Questions