Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gyuwan Kim

AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking

May 24, 2025

Soyoung Yoon, Gyuwan Kim, Gyu-Hwung Cho, Seung-won Hwang

Figure 1 for AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking

Figure 2 for AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking

Figure 3 for AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking

Figure 4 for AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking

Abstract:Listwise reranking with large language models (LLMs) enhances top-ranked results in retrieval-based applications. Due to the limit in context size and high inference cost of long context, reranking is typically performed over a fixed size of small subsets, with the final ranking aggregated from these partial results. This fixed computation disregards query difficulty and document distribution, leading to inefficiencies. We propose AcuRank, an adaptive reranking framework that dynamically adjusts both the amount and target of computation based on uncertainty estimates over document relevance. Using a Bayesian TrueSkill model, we iteratively refine relevance estimates until reaching sufficient confidence levels, and our explicit modeling of ranking uncertainty enables principled control over reranking behavior and avoids unnecessary updates to confident predictions. Results on the TREC-DL and BEIR benchmarks show that our method consistently achieves a superior accuracy-efficiency trade-off and scales better with compute than fixed-computation baselines. These results highlight the effectiveness and generalizability of our method across diverse retrieval tasks and LLM-based reranking models.

* 22 pages, 3 figures. The first two authors contributed equally. Author order is randomly determined via coin toss

Via

Access Paper or Ask Questions

ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts

May 22, 2025

Dongwon Noh, Donghyeok Koh, Junghun Yuk, Gyuwan Kim, Jaeyong Lee, Kyungtae Lim, Cheoneum Park

Abstract:Prior benchmarks for evaluating the domain-specific knowledge of large language models (LLMs) lack the scalability to handle complex academic tasks. To address this, we introduce \texttt{ScholarBench}, a benchmark centered on deep expert knowledge and complex academic problem-solving, which evaluates the academic reasoning ability of LLMs and is constructed through a three-step process. \texttt{ScholarBench} targets more specialized and logically complex contexts derived from academic literature, encompassing five distinct problem types. Unlike prior benchmarks, \texttt{ScholarBench} evaluates the abstraction, comprehension, and reasoning capabilities of LLMs across eight distinct research domains. To ensure high-quality evaluation data, we define category-specific example attributes and design questions that are aligned with the characteristic research methodologies and discourse structures of each domain. Additionally, this benchmark operates as an English-Korean bilingual dataset, facilitating simultaneous evaluation for linguistic capabilities of LLMs in both languages. The benchmark comprises 5,031 examples in Korean and 5,309 in English, with even state-of-the-art models like o3-mini achieving an average evaluation score of only 0.543, demonstrating the challenging nature of this benchmark.

Via

Access Paper or Ask Questions

MacRAG: Compress, Slice, and Scale-up for Multi-Scale Adaptive Context RAG

May 10, 2025

Woosang Lim, Zekun Li, Gyuwan Kim, Sungyoung Ji, HyeonJung Kim, Kyuri Choi, Jin Hyuk Lim, Kyungpyo Park, William Yang Wang

Figure 1 for MacRAG: Compress, Slice, and Scale-up for Multi-Scale Adaptive Context RAG

Figure 2 for MacRAG: Compress, Slice, and Scale-up for Multi-Scale Adaptive Context RAG

Figure 3 for MacRAG: Compress, Slice, and Scale-up for Multi-Scale Adaptive Context RAG

Figure 4 for MacRAG: Compress, Slice, and Scale-up for Multi-Scale Adaptive Context RAG

Abstract:Long-context (LC) Large Language Models (LLMs) combined with Retrieval-Augmented Generation (RAG) hold strong potential for complex multi-hop and large-document tasks. However, existing RAG systems often suffer from imprecise retrieval, incomplete context coverage under constrained context windows, and fragmented information caused by suboptimal context construction. We introduce Multi-scale Adaptive Context RAG (MacRAG), a hierarchical retrieval framework that compresses and partitions documents into coarse-to-fine granularities, then adaptively merges relevant contexts through chunk- and document-level expansions in real time. By starting from the finest-level retrieval and progressively incorporating higher-level and broader context, MacRAG constructs effective query-specific long contexts, optimizing both precision and coverage. Evaluations on the challenging LongBench expansions of HotpotQA, 2WikiMultihopQA, and Musique confirm that MacRAG consistently surpasses baseline RAG pipelines on single- and multi-step generation with Llama-3.1-8B, Gemini-1.5-pro, and GPT-4o. Our results establish MacRAG as an efficient, scalable solution for real-world long-context, multi-hop reasoning. Our code is available at https://github.com/Leezekun/MacRAG.

Via

Access Paper or Ask Questions

Detecting Training Data of Large Language Models via Expectation Maximization

Oct 10, 2024

Gyuwan Kim, Yang Li, Evangelia Spiliopoulou, Jie Ma, Miguel Ballesteros, William Yang Wang

Figure 1 for Detecting Training Data of Large Language Models via Expectation Maximization

Figure 2 for Detecting Training Data of Large Language Models via Expectation Maximization

Figure 3 for Detecting Training Data of Large Language Models via Expectation Maximization

Figure 4 for Detecting Training Data of Large Language Models via Expectation Maximization

Abstract:The widespread deployment of large language models (LLMs) has led to impressive advancements, yet information about their training data, a critical factor in their performance, remains undisclosed. Membership inference attacks (MIAs) aim to determine whether a specific instance was part of a target model's training data. MIAs can offer insights into LLM outputs and help detect and address concerns such as data contamination and compliance with privacy and copyright standards. However, applying MIAs to LLMs presents unique challenges due to the massive scale of pre-training data and the ambiguous nature of membership. Additionally, creating appropriate benchmarks to evaluate MIA methods is not straightforward, as training and test data distributions are often unknown. In this paper, we introduce EM-MIA, a novel MIA method for LLMs that iteratively refines membership scores and prefix scores via an expectation-maximization algorithm, leveraging the duality that the estimates of these scores can be improved by each other. Membership scores and prefix scores assess how each instance is likely to be a member and discriminative as a prefix, respectively. Our method achieves state-of-the-art results on the WikiMIA dataset. To further evaluate EM-MIA, we present OLMoMIA, a benchmark built from OLMo resources, which allows us to control the difficulty of MIA tasks with varying degrees of overlap between training and test data distributions. We believe that EM-MIA serves as a robust MIA method for LLMs and that OLMoMIA provides a valuable resource for comprehensively evaluating MIA approaches, thereby driving future research in this critical area.

* 14 pages

Via

Access Paper or Ask Questions

Towards standardizing Korean Grammatical Error Correction: Datasets and Annotation

Oct 27, 2022

Soyoung Yoon, Sungjoon Park, Gyuwan Kim, Junhee Cho, Kihyo Park, Gyu Tae Kim, Minjoon Seo, Alice Oh

Figure 1 for Towards standardizing Korean Grammatical Error Correction: Datasets and Annotation

Figure 2 for Towards standardizing Korean Grammatical Error Correction: Datasets and Annotation

Figure 3 for Towards standardizing Korean Grammatical Error Correction: Datasets and Annotation

Figure 4 for Towards standardizing Korean Grammatical Error Correction: Datasets and Annotation

Abstract:Research on Korean grammatical error correction (GEC) is limited compared to other major languages such as English and Chinese. We attribute this problematic circumstance to the lack of a carefully designed evaluation benchmark for Korean. Thus, in this work, we first collect three datasets from different sources (Kor-Lang8, Kor-Native, and Kor-Learner) to cover a wide range of error types and annotate them using our newly proposed tool called Korean Automatic Grammatical error Annotation System (KAGAS). KAGAS is a carefully designed edit alignment & classification tool that considers the nature of Korean on generating an alignment between a source sentence and a target sentence, and identifies error types on each aligned edit. We also present baseline models fine-tuned over our datasets. We show that the model trained with our datasets significantly outperforms the public statistical GEC system (Hanspell) on a wider range of error types, demonstrating the diversity and usefulness of the datasets.

* Add affiliation and email address

Via

Access Paper or Ask Questions

Bridging the Training-Inference Gap for Dense Phrase Retrieval

Oct 25, 2022

Gyuwan Kim, Jinhyuk Lee, Barlas Oguz, Wenhan Xiong, Yizhe Zhang, Yashar Mehdad, William Yang Wang

Abstract:Building dense retrievers requires a series of standard procedures, including training and validating neural models and creating indexes for efficient search. However, these procedures are often misaligned in that training objectives do not exactly reflect the retrieval scenario at inference time. In this paper, we explore how the gap between training and inference in dense retrieval can be reduced, focusing on dense phrase retrieval (Lee et al., 2021) where billions of representations are indexed at inference. Since validating every dense retriever with a large-scale index is practically infeasible, we propose an efficient way of validating dense retrievers using a small subset of the entire corpus. This allows us to validate various training strategies including unifying contrastive loss terms and using hard negatives for phrase retrieval, which largely reduces the training-inference discrepancy. As a result, we improve top-1 phrase retrieval accuracy by 2~3 points and top-20 passage retrieval accuracy by 2~4 points for open-domain question answering. Our work urges modeling dense retrievers with careful consideration of training and inference via efficient validation while advancing phrase retrieval as a general solution for dense retrieval.

* Findings of EMNLP 2022; 12 pages, 3 figures

Via

Access Paper or Ask Questions

Dynamic-TinyBERT: Boost TinyBERT's Inference Efficiency by Dynamic Sequence Length

Nov 18, 2021

Shira Guskin, Moshe Wasserblat, Ke Ding, Gyuwan Kim

Figure 1 for Dynamic-TinyBERT: Boost TinyBERT's Inference Efficiency by Dynamic Sequence Length

Figure 2 for Dynamic-TinyBERT: Boost TinyBERT's Inference Efficiency by Dynamic Sequence Length

Figure 3 for Dynamic-TinyBERT: Boost TinyBERT's Inference Efficiency by Dynamic Sequence Length

Figure 4 for Dynamic-TinyBERT: Boost TinyBERT's Inference Efficiency by Dynamic Sequence Length

Abstract:Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized. TinyBERT addresses the computational efficiency by self-distilling BERT into a smaller transformer representation having fewer layers and smaller internal embedding. However, TinyBERT's performance drops when we reduce the number of layers by 50%, and drops even more abruptly when we reduce the number of layers by 75% for advanced NLP tasks such as span question answering. Additionally, a separate model must be trained for each inference scenario with its distinct computational budget. In this work we present Dynamic-TinyBERT, a TinyBERT model that utilizes sequence-length reduction and Hyperparameter Optimization for enhanced inference efficiency per any computational budget. Dynamic-TinyBERT is trained only once, performing on-par with BERT and achieving an accuracy-speedup trade-off superior to any other efficient approaches (up to 3.3x with <1% loss-drop). Upon publication, the code to reproduce our work will be open-sourced.

* ENLSP NeurIPS Workshop 2021, 7 pages

Via

Access Paper or Ask Questions

SSMix: Saliency-Based Span Mixup for Text Classification

Jun 15, 2021

Soyoung Yoon, Gyuwan Kim, Kyumin Park

Figure 1 for SSMix: Saliency-Based Span Mixup for Text Classification

Figure 2 for SSMix: Saliency-Based Span Mixup for Text Classification

Figure 3 for SSMix: Saliency-Based Span Mixup for Text Classification

Figure 4 for SSMix: Saliency-Based Span Mixup for Text Classification

Abstract:Data augmentation with mixup has shown to be effective on various computer vision tasks. Despite its great success, there has been a hurdle to apply mixup to NLP tasks since text consists of discrete tokens with variable length. In this work, we propose SSMix, a novel mixup method where the operation is performed on input text rather than on hidden vectors like previous approaches. SSMix synthesizes a sentence while preserving the locality of two original texts by span-based mixing and keeping more tokens related to the prediction relying on saliency information. With extensive experiments, we empirically validate that our method outperforms hidden-level mixup methods on a wide range of text classification benchmarks, including textual entailment, sentiment classification, and question-type classification. Our code is available at https://github.com/clovaai/ssmix.

* Findings of ACL 2021

Via

Access Paper or Ask Questions

Consistency Training with Virtual Adversarial Discrete Perturbation

Apr 15, 2021

Jungsoo Park, Gyuwan Kim, Jaewoo Kang

Figure 1 for Consistency Training with Virtual Adversarial Discrete Perturbation

Figure 2 for Consistency Training with Virtual Adversarial Discrete Perturbation

Figure 3 for Consistency Training with Virtual Adversarial Discrete Perturbation

Figure 4 for Consistency Training with Virtual Adversarial Discrete Perturbation

Abstract:We propose an effective consistency training framework that enforces a training model's predictions given original and perturbed inputs to be similar by adding a discrete noise that would incur the highest divergence between predictions. This virtual adversarial discrete noise obtained by replacing a small portion of tokens while keeping original semantics as much as possible efficiently pushes a training model's decision boundary. Moreover, we perform an iterative refinement process to alleviate the degraded fluency of the perturbed sentence due to the conditional independence assumption. Experimental results show that our proposed method outperforms other consistency training baselines with text editing, paraphrasing, or a continuous noise on semi-supervised text classification tasks and a robustness benchmark.

* 12 pages

Via

Access Paper or Ask Questions

Two-stage Textual Knowledge Distillation to Speech Encoder for Spoken Language Understanding

Oct 25, 2020

Seongbin Kim, Gyuwan Kim, Seongjin Shin, Sangmin Lee

Figure 1 for Two-stage Textual Knowledge Distillation to Speech Encoder for Spoken Language Understanding

Figure 2 for Two-stage Textual Knowledge Distillation to Speech Encoder for Spoken Language Understanding

Figure 3 for Two-stage Textual Knowledge Distillation to Speech Encoder for Spoken Language Understanding

Figure 4 for Two-stage Textual Knowledge Distillation to Speech Encoder for Spoken Language Understanding

Abstract:End-to-end approaches open a new way for more accurate and efficient spoken language understanding (SLU) systems by alleviating the drawbacks of traditional pipeline systems. Previous works exploit textual information for an SLU model via pre-training with automatic speech recognition or fine-tuning with knowledge distillation. To utilize textual information more effectively, this work proposes a two-stage textual knowledge distillation method that matches utterance-level representations and predicted logits of two modalities during pre-training and fine-tuning, sequentially. We use vq-wav2vec BERT as a speech encoder because it captures general and rich features. Furthermore, we improve the performance, especially in a low-resource scenario, with data augmentation methods by randomly masking spans of discrete audio tokens and contextualized hidden representations. Consequently, we push the state-of-the-art on the Fluent Speech Commands, achieving 99.7% test accuracy in the full dataset setting and 99.5% in the 10% subset setting. Throughout the ablation studies, we empirically verify that all used methods are crucial to the final performance, providing the best practice for spoken language understanding. Code to reproduce our results will be available upon publication.

* Preprint; 5 pages, 1 figure

Via

Access Paper or Ask Questions