Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Janghoon Han

DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports

Dec 19, 2025

Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park, Hosung Song, Stanley Jungkyu Choi, Moontae Lee, Honglak Lee

Abstract:As large language models (LLMs) advance, deep research systems can generate expert-level reports via multi-step reasoning and evidence-based synthesis, but evaluating such reports remains challenging. Existing benchmarks often lack systematic criteria for expert reporting, evaluations that rely heavily on LLM judges can fail to capture issues that require expert judgment, and source verification typically covers only a limited subset of explicitly cited statements rather than report-wide factual reliability. We introduce DEER, a benchmark for evaluating expert-level deep research reports. DEER comprises 50 report-writing tasks spanning 13 domains and an expert-grounded evaluation taxonomy (7 dimensions, 25 sub-dimension) operationalized into 130 fine-grained rubric items. DEER further provides task-specific expert guidance to help LLM judges assess expert-level report quality more consistently. Complementing rubric-based assessment, we propose a document-level fact-checking architecture that extracts and verifies all claims across the entire report, including both cited and uncited ones, and quantifies external-evidence quality. DEER correlates closely with human expert judgments and yields interpretable diagnostics of system strengths and weaknesses.

* Work in progress

Via

Access Paper or Ask Questions

KL Penalty Control via Perturbation for Direct Preference Optimization

Feb 18, 2025

Sangkyu Lee, Janghoon Han, Hosung Song, Stanley Jungkyu Choi, Honglak Lee, Youngjae Yu

Figure 1 for KL Penalty Control via Perturbation for Direct Preference Optimization

Figure 2 for KL Penalty Control via Perturbation for Direct Preference Optimization

Figure 3 for KL Penalty Control via Perturbation for Direct Preference Optimization

Figure 4 for KL Penalty Control via Perturbation for Direct Preference Optimization

Abstract:Direct Preference Optimization (DPO) demonstrates the advantage of aligning a large language model with human preference using only an offline dataset. However, DPO has the limitation that the KL penalty, which prevents excessive deviation from the reference model, is static throughout the training process. Several methods try to turn this static KL penalty into a dynamic one, but no approach can adaptively assign different KL penalties for each preference pair. In this paper, we propose $\varepsilon$-Direct Preference Optimization ($\varepsilon$-DPO), which allows adaptive control of the KL penalty strength $\beta$ for each preference pair. Specifically, $\varepsilon$-DPO adaptively controls $\beta$ for each preference pair based on the monotonicity of logits as a preference model under the perturbation of $\beta$ during training by simply reusing the logit of the current policy and the reference policy. Experimental results show that $\varepsilon$-DPO outperforms existing direct alignment algorithms and KL penalty relaxation methods on general chatbot benchmarks, highlighting the significance of adaptive KL penalty relaxation at the instance-level in DPO.

* Preprint; Under review

Via

Access Paper or Ask Questions

Deep Exploration of Cross-Lingual Zero-Shot Generalization in Instruction Tuning

Jun 13, 2024

Janghoon Han, Changho Lee, Joongbo Shin, Stanley Jungkyu Choi, Honglak Lee, Kynghoon Bae

Abstract:Instruction tuning has emerged as a powerful technique, significantly boosting zero-shot performance on unseen tasks. While recent work has explored cross-lingual generalization by applying instruction tuning to multilingual models, previous studies have primarily focused on English, with a limited exploration of non-English tasks. For an in-depth exploration of cross-lingual generalization in instruction tuning, we perform instruction tuning individually for two distinct language meta-datasets. Subsequently, we assess the performance on unseen tasks in a language different from the one used for training. To facilitate this investigation, we introduce a novel non-English meta-dataset named "KORANI" (Korean Natural Instruction), comprising 51 Korean benchmarks. Moreover, we design cross-lingual templates to mitigate discrepancies in language and instruction-format of the template between training and inference within the cross-lingual setting. Our experiments reveal consistent improvements through cross-lingual generalization in both English and Korean, outperforming baseline by average scores of 20.7\% and 13.6\%, respectively. Remarkably, these enhancements are comparable to those achieved by monolingual instruction tuning and even surpass them in some tasks. The result underscores the significance of relevant data acquisition across languages over linguistic congruence with unseen tasks during instruction tuning.

* Findings of ACL 2024 (Camera-ready), by Janghoon Han and Changho Lee, with equal contribution

Via

Access Paper or Ask Questions

Instruction Matters, a Simple yet Effective Task Selection Approach in Instruction Tuning for Specific Tasks

Apr 25, 2024

Changho Lee, Janghoon Han, Seonghyeon Ye, Stanley Jungkyu Choi, Honglak Lee, Kyunghoon Bae

Figure 1 for Instruction Matters, a Simple yet Effective Task Selection Approach in Instruction Tuning for Specific Tasks

Figure 2 for Instruction Matters, a Simple yet Effective Task Selection Approach in Instruction Tuning for Specific Tasks

Figure 3 for Instruction Matters, a Simple yet Effective Task Selection Approach in Instruction Tuning for Specific Tasks

Figure 4 for Instruction Matters, a Simple yet Effective Task Selection Approach in Instruction Tuning for Specific Tasks

Abstract:Instruction tuning has shown its ability to not only enhance zero-shot generalization across various tasks but also its effectiveness in improving the performance of specific tasks. A crucial aspect in instruction tuning for a particular task is a strategic selection of related tasks that offer meaningful supervision, thereby enhancing efficiency and preventing performance degradation from irrelevant tasks. Our research reveals that leveraging instruction information \textit{alone} enables the identification of pertinent tasks for instruction tuning. This approach is notably simpler compared to traditional methods that necessitate complex measurements of pairwise transferability between tasks or the creation of data samples for the target task. Furthermore, by additionally learning the unique instructional template style of the meta-dataset, we observe an improvement in task selection accuracy, which contributes to enhanced overall performance. Experimental results demonstrate that training on a small set of tasks, chosen solely based on the instructions, leads to substantial performance improvements on benchmarks like P3, Big-Bench, NIV2, and Big-Bench Hard. Significantly, these improvements exceed those achieved by prior task selection methods, highlighting the efficacy of our approach.

* 21 pages, 6 figures, 16 tables

Via

Access Paper or Ask Questions

External Knowledge Selection with Weighted Negative Sampling in Knowledge-grounded Task-oriented Dialogue Systems

Sep 06, 2022

Janghoon Han, Joongbo Shin, Hosung Song, Hyunjik Jo, Gyeonghun Kim, Yireun Kim, Stanley Jungkyu Choi

Figure 1 for External Knowledge Selection with Weighted Negative Sampling in Knowledge-grounded Task-oriented Dialogue Systems

Figure 2 for External Knowledge Selection with Weighted Negative Sampling in Knowledge-grounded Task-oriented Dialogue Systems

Figure 3 for External Knowledge Selection with Weighted Negative Sampling in Knowledge-grounded Task-oriented Dialogue Systems

Figure 4 for External Knowledge Selection with Weighted Negative Sampling in Knowledge-grounded Task-oriented Dialogue Systems

Abstract:Constructing a robust dialogue system on spoken conversations bring more challenge than written conversation. In this respect, DSTC10-Track2-Task2 is proposed, which aims to build a task-oriented dialogue (TOD) system incorporating unstructured external knowledge on a spoken conversation, extending DSTC9-Track1. This paper introduces our system containing four advanced methods: data construction, weighted negative sampling, post-training, and style transfer. We first automatically construct a large training data because DSTC10-Track2 does not release the official training set. For the knowledge selection task, we propose weighted negative sampling to train the model more fine-grained manner. We also employ post-training and style transfer for the response generation task to generate an appropriate response with a similar style to the target response. In the experiment, we investigate the effect of weighted negative sampling, post-training, and style transfer. Our model ranked 7 out of 16 teams in the objective evaluation and 6 in human evaluation.

* 7page, DSTC10-Track2-task2

Via

Access Paper or Ask Questions

TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving Language Models

Apr 29, 2022

Joel Jang, Seonghyeon Ye, Changho Lee, Sohee Yang, Joongbo Shin, Janghoon Han, Gyeonghun Kim, Minjoon Seo

Figure 1 for TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving Language Models

Figure 2 for TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving Language Models

Figure 3 for TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving Language Models

Figure 4 for TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving Language Models

Abstract:Language Models (LMs) become outdated as the world changes; they often fail to perform tasks requiring recent factual information which was absent or different during training, a phenomenon called temporal misalignment. This is especially a challenging problem because the research community still lacks a coherent dataset for assessing the adaptability of LMs to frequently-updated knowledge corpus such as Wikipedia. To this end, we introduce TemporalWiki, a lifelong benchmark for ever-evolving LMs that utilizes the difference between consecutive snapshots of English Wikipedia and English Wikidata for training and evaluation, respectively. The benchmark hence allows researchers to periodically track an LM's ability to retain previous knowledge and acquire updated/new knowledge at each point in time. We also find that training an LM on the diff data through continual learning methods achieves similar or better perplexity than on the entire snapshot in our benchmark with 12 times less computational cost, which verifies that factual knowledge in LMs can be safely updated with minimal training data via continual learning. The dataset and the code are available at https://github.com/joeljang/temporalwiki .

Via

Access Paper or Ask Questions

Towards Continual Knowledge Learning of Language Models

Oct 26, 2021

Joel Jang, Seonghyeon Ye, Sohee Yang, Joongbo Shin, Janghoon Han, Gyeonghun Kim, Stanley Jungkyu Choi, Minjoon Seo

Figure 1 for Towards Continual Knowledge Learning of Language Models

Figure 2 for Towards Continual Knowledge Learning of Language Models

Figure 3 for Towards Continual Knowledge Learning of Language Models

Figure 4 for Towards Continual Knowledge Learning of Language Models

Abstract:Large Language Models (LMs) are known to encode world knowledge in their parameters as they pretrain on a vast amount of web corpus, which is often utilized for performing knowledge-dependent downstream tasks such as question answering, fact-checking, and open dialogue. In real-world scenarios, the world knowledge stored in the LMs can quickly become outdated as the world changes, but it is non-trivial to avoid catastrophic forgetting and reliably acquire new knowledge while preserving invariant knowledge. To push the community towards better maintenance of ever-changing LMs, we formulate a new continual learning (CL) problem called Continual Knowledge Learning (CKL). We construct a new benchmark and metric to quantify the retention of time-invariant world knowledge, the update of outdated knowledge, and the acquisition of new knowledge. We adopt applicable recent methods from literature to create several strong baselines. Through extensive experiments, we find that CKL exhibits unique challenges that are not addressed in previous CL setups, where parameter expansion is necessary to reliably retain and learn knowledge simultaneously. By highlighting the critical causes of knowledge forgetting, we show that CKL is a challenging and important problem that helps us better understand and train ever-changing LMs. The benchmark datasets, evaluation script, and baseline code to reproduce our results are available at https://github.com/joeljang/continual-knowledge-learning.

Via

Access Paper or Ask Questions