Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sungjin Park

Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning

Dec 20, 2024

Sungjin Park, Xiao Liu, Yeyun Gong, Edward Choi

Figure 1 for Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning

Figure 2 for Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning

Figure 3 for Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning

Figure 4 for Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning

Abstract:Despite recent advances in large language models, open-source models often struggle to consistently perform well on complex reasoning tasks. Existing ensemble methods, whether applied at the token or output levels, fail to address these challenges. In response, we present Language model Ensemble with Monte Carlo Tree Search (LE-MCTS), a novel framework for process-level ensembling of language models. LE-MCTS formulates step-by-step reasoning with an ensemble of language models as a Markov decision process. In this framework, states represent intermediate reasoning paths, while actions consist of generating the next reasoning step using one of the language models selected from a predefined pool. Guided by a process-based reward model, LE-MCTS performs a tree search over the reasoning steps generated by different language models, identifying the most accurate reasoning chain. Experimental results on five mathematical reasoning benchmarks demonstrate that our approach outperforms both single language model decoding algorithms and language model ensemble methods. Notably, LE-MCTS improves performance by 3.6% and 4.3% on the MATH and MQA datasets, respectively, highlighting its effectiveness in solving complex reasoning problems.

Via

Access Paper or Ask Questions

Multimodal Transformer With a Low-Computational-Cost Guarantee

Feb 23, 2024

Sungjin Park, Edward Choi

Abstract:Transformer-based models have significantly improved performance across a range of multimodal understanding tasks, such as visual question answering and action recognition. However, multimodal Transformers significantly suffer from a quadratic complexity of the multi-head attention with the input sequence length, especially as the number of modalities increases. To address this, we introduce Low-Cost Multimodal Transformer (LoCoMT), a novel multimodal attention mechanism that aims to reduce computational cost during training and inference with minimal performance loss. Specifically, by assigning different multimodal attention patterns to each attention head, LoCoMT can flexibly control multimodal signals and theoretically ensures a reduced computational cost compared to existing multimodal Transformer variants. Experimental results on two multimodal datasets, namely Audioset and MedVidCL demonstrate that LoCoMT not only reduces GFLOPs but also matches or even outperforms established models.

* Accepted to ICASSP 2024 (5 pages)

Via

Access Paper or Ask Questions

FactKG: Fact Verification via Reasoning on Knowledge Graphs

May 19, 2023

Jiho Kim, Sungjin Park, Yeonsu Kwon, Yohan Jo, James Thorne, Edward Choi

Abstract:In real world applications, knowledge graphs (KG) are widely used in various domains (e.g. medical applications and dialogue agents). However, for fact verification, KGs have not been adequately utilized as a knowledge source. KGs can be a valuable knowledge source in fact verification due to their reliability and broad applicability. A KG consists of nodes and edges which makes it clear how concepts are linked together, allowing machines to reason over chains of topics. However, there are many challenges in understanding how these machine-readable concepts map to information in text. To enable the community to better use KGs, we introduce a new dataset, FactKG: Fact Verification via Reasoning on Knowledge Graphs. It consists of 108k natural language claims with five types of reasoning: One-hop, Conjunction, Existence, Multi-hop, and Negation. Furthermore, FactKG contains various linguistic patterns, including colloquial style claims as well as written style claims to increase practicality. Lastly, we develop a baseline approach and analyze FactKG over these reasoning types. We believe FactKG can advance both reliability and practicality in KG-based fact verification.

* Accepted to ACL 2023

Via

Access Paper or Ask Questions

Do Language Models Understand Measurements?

Oct 23, 2022

Sungjin Park, Seungwoo Ryu, Edward Choi

Abstract:Recent success of pre-trained language models (PLMs) has stimulated interest in their ability to understand and work with numbers. Yet, the numerical reasoning over measurements has not been formally studied despite their importance. In this study, we show that PLMs lack the capability required for reasoning over measurements. Furthermore, we find that a language model trained on a measurement-rich corpus shows better performance on understanding measurements. We propose a simple embedding strategy to better distinguish between numbers and units, which leads to a significant improvement in the probing tasks.

* Findings of EMNLP 2022

Via

Access Paper or Ask Questions

Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer

Apr 15, 2022

Hyungyung Lee, Sungjin Park, Edward Choi

Figure 1 for Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer

Figure 2 for Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer

Figure 3 for Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer

Figure 4 for Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer

Abstract:Though deep generative models have gained a lot of attention, most of the existing works are designed for the unimodal generation task. In this paper, we explore a new method for unconditional image-text pair generation. We propose MXQ-VAE, a vector quantization method for multimodal image-text representation. MXQ-VAE accepts a paired image and text as input, and learns a joint quantized representation space, so that the image-text pair can be converted to a sequence of unified indices. Then we can use autoregressive generative models to model the joint image-text representation, and even perform unconditional image-text pair generation. Extensive experimental results demonstrate that our approach effectively generates semantically consistent image-text pair and also enhances meaningful alignment between image and text.

* ICLR 2022 workshop on Deep Generative Models for Highly Structured Data

Via

Access Paper or Ask Questions

Graph-Text Multi-Modal Pre-training for Medical Representation Learning

Mar 18, 2022

Sungjin Park, Seongsu Bae, Jiho Kim, Tackeun Kim, Edward Choi

Figure 1 for Graph-Text Multi-Modal Pre-training for Medical Representation Learning

Figure 2 for Graph-Text Multi-Modal Pre-training for Medical Representation Learning

Figure 3 for Graph-Text Multi-Modal Pre-training for Medical Representation Learning

Figure 4 for Graph-Text Multi-Modal Pre-training for Medical Representation Learning

Abstract:As the volume of Electronic Health Records (EHR) sharply grows, there has been emerging interest in learning the representation of EHR for healthcare applications. Representation learning of EHR requires appropriate modeling of the two dominant modalities in EHR: structured data and unstructured text. In this paper, we present MedGTX, a pre-trained model for multi-modal representation learning of the structured and textual EHR data. MedGTX uses a novel graph encoder to exploit the graphical nature of structured EHR data, and a text encoder to handle unstructured text, and a cross-modal encoder to learn a joint representation space. We pre-train our model through four proxy tasks on MIMIC-III, an open-source EHR data, and evaluate our model on two clinical benchmarks and three novel downstream tasks which tackle real-world problems in EHR data. The results consistently show the effectiveness of pre-training the model for joint representation of both structured and unstructured information from EHR. Given the promising performance of MedGTX, we believe this work opens a new door to jointly understanding the two fundamental modalities of EHR data.

* To appear in Proceedings of the Conference on Health, Inference, and Learning (CHIL 2022)

Via

Access Paper or Ask Questions

FreeTalky: Don't Be Afraid! Conversations Made Easier by a Humanoid Robot using Persona-based Dialogue

Dec 08, 2021

Chanjun Park, Yoonna Jang, Seolhwa Lee, Sungjin Park, Heuiseok Lim

Figure 1 for FreeTalky: Don't Be Afraid! Conversations Made Easier by a Humanoid Robot using Persona-based Dialogue

Figure 2 for FreeTalky: Don't Be Afraid! Conversations Made Easier by a Humanoid Robot using Persona-based Dialogue

Figure 3 for FreeTalky: Don't Be Afraid! Conversations Made Easier by a Humanoid Robot using Persona-based Dialogue

Figure 4 for FreeTalky: Don't Be Afraid! Conversations Made Easier by a Humanoid Robot using Persona-based Dialogue

Abstract:We propose a deep learning-based foreign language learning platform, named FreeTalky, for people who experience anxiety dealing with foreign languages, by employing a humanoid robot NAO and various deep learning models. A persona-based dialogue system that is embedded in NAO provides an interesting and consistent multi-turn dialogue for users. Also, an grammar error correction system promotes improvement in grammar skills of the users. Thus, our system enables personalized learning based on persona dialogue and facilitates grammar learning of a user using grammar error feedback. Furthermore, we verified whether FreeTalky provides practical help in alleviating xenoglossophobia by replacing the real human in the conversation with a NAO robot, through human evaluation.

* Accepted for Artificial Intelligence for Education (AI4EDU) workshop at AAAI 2022

Via

Access Paper or Ask Questions

Real-time Denoising and Dereverberation with Tiny Recurrent U-Net

Feb 10, 2021

Hyeong-Seok Choi, Sungjin Park, Jie Hwan Lee, Hoon Heo, Dongsuk Jeon, Kyogu Lee

Figure 1 for Real-time Denoising and Dereverberation with Tiny Recurrent U-Net

Figure 2 for Real-time Denoising and Dereverberation with Tiny Recurrent U-Net

Figure 3 for Real-time Denoising and Dereverberation with Tiny Recurrent U-Net

Figure 4 for Real-time Denoising and Dereverberation with Tiny Recurrent U-Net

Abstract:Modern deep learning-based models have seen outstanding performance improvement with speech enhancement tasks. The number of parameters of state-of-the-art models, however, is often too large to be deployed on devices for real-world applications. To this end, we propose Tiny Recurrent U-Net (TRU-Net), a lightweight online inference model that matches the performance of current state-of-the-art models. The size of the quantized version of TRU-Net is 362 kilobytes, which is small enough to be deployed on edge devices. In addition, we combine the small-sized model with a new masking method called phase-aware $\beta$-sigmoid mask, which enables simultaneous denoising and dereverberation. Results of both objective and subjective evaluations have shown that our model can achieve competitive performance with the current state-of-the-art models on benchmark datasets using fewer parameters by orders of magnitude.

* 5 pages, 2 figures, 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). arXiv admin note: text overlap with arXiv:2006.00687

Via

Access Paper or Ask Questions

Multi-View Attention Networks for Visual Dialog

Apr 29, 2020

Sungjin Park, Taesun Whang, Yeochan Yoon, Hueiseok Lim

Figure 1 for Multi-View Attention Networks for Visual Dialog

Figure 2 for Multi-View Attention Networks for Visual Dialog

Figure 3 for Multi-View Attention Networks for Visual Dialog

Figure 4 for Multi-View Attention Networks for Visual Dialog

Abstract:Visual dialog is a challenging vision-language task in which a series of questions visually grounded by a given image are answered. To resolve the visual dialog task, a high-level understanding of various multimodal inputs (e.g., question, dialog history, image, and answer) is required. Specifically, it is necessary for an agent to 1) understand question-relevant dialog history and 2) focus on question-relevant visual contents among the diverse visual contents in a given image. In this paper, we propose Multi-View Attention Network (MVAN), which considers complementary views of multimodal inputs based on attention mechanisms. MVAN effectively captures the question-relevant information from the dialog history with two different textual-views (i.e., Topic Aggregation and Context Matching), and integrates multimodal representations with two-step fusion process. Experimental results on VisDial v1.0 and v0.9 benchmarks show the effectiveness of our proposed model, which outperforms the previous state-of-the-art methods with respect to all evaluation metrics.

Via

Access Paper or Ask Questions

Semi-supervised Disentanglement with Independent Vector Variational Autoencoders

Mar 14, 2020

Bo-Kyeong Kim, Sungjin Park, Geonmin Kim, Soo-Young Lee

Figure 1 for Semi-supervised Disentanglement with Independent Vector Variational Autoencoders

Figure 2 for Semi-supervised Disentanglement with Independent Vector Variational Autoencoders

Figure 3 for Semi-supervised Disentanglement with Independent Vector Variational Autoencoders

Figure 4 for Semi-supervised Disentanglement with Independent Vector Variational Autoencoders

Abstract:We aim to separate the generative factors of data into two latent vectors in a variational autoencoder. One vector captures class factors relevant to target classification tasks, while the other vector captures style factors relevant to the remaining information. To learn the discrete class features, we introduce supervision using a small amount of labeled data, which can simply yet effectively reduce the effort required for hyperparameter tuning performed in existing unsupervised methods. Furthermore, we introduce a learning objective to encourage statistical independence between the vectors. We show that (i) this vector independence term exists within the result obtained on decomposing the evidence lower bound with multiple latent vectors, and (ii) encouraging such independence along with reducing the total correlation within the vectors enhances disentanglement performance. Experiments conducted on several image datasets demonstrate that the disentanglement achieved via our method can improve classification performance and generation controllability.

* 24 pages: 10 p for main paper (8 figures) and 14 p for supplementary material (12 figures). A shortened version of this paper is currently under review by a conference

Via

Access Paper or Ask Questions