Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jaekyeom Kim

Process Reward Models That Think

Apr 23, 2025

Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang

Figure 1 for Process Reward Models That Think

Figure 2 for Process Reward Models That Think

Figure 3 for Process Reward Models That Think

Figure 4 for Process Reward Models That Think

Abstract:Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers -- using only 1% of the process labels in PRM800K -- across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models will be released at https://github.com/mukhal/thinkprm.

Via

Access Paper or Ask Questions

MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

Apr 13, 2025

Yunxiang Zhang, Muhammad Khalifa, Shitanshu Bhushan, Grant D Murphy, Lajanugen Logeswaran, Jaekyeom Kim, Moontae Lee, Honglak Lee, Lu Wang

Figure 1 for MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

Figure 2 for MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

Figure 3 for MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

Figure 4 for MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

Abstract:Existing evaluation of large language model (LLM) agents on scientific discovery lacks objective baselines and metrics to assess the viability of their proposed methods. To address this issue, we introduce MLRC-Bench, a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions. Our benchmark highlights open research problems that demand novel methodologies, in contrast to recent benchmarks such as OpenAI's MLE-Bench (Chan et al., 2024) and METR's RE-Bench (Wijk et al., 2024), which focus on well-established research tasks that are largely solvable through sufficient engineering effort. Unlike prior work, e.g., AI Scientist (Lu et al., 2024b), which evaluates the end-to-end agentic pipeline by using LLM-as-a-judge, MLRC-Bench measures the key steps of proposing and implementing novel research methods and evaluates them with newly proposed rigorous protocol and objective metrics. Our curated suite of 7 competition tasks reveals significant challenges for LLM agents. Even the best-performing tested agent (gemini-exp-1206 under MLAB (Huang et al., 2024a)) closes only 9.3% of the gap between baseline and top human participant scores. Furthermore, our analysis reveals a misalignment between the LLM-judged innovation and their actual performance on cutting-edge ML research problems. MLRC-Bench is a dynamic benchmark, which is designed to continually grow with new ML competitions to encourage rigorous and objective evaluations of AI's research capabilities.

Via

Access Paper or Ask Questions

Do Not Trust Licenses You See -- Dataset Compliance Requires Massive-Scale AI-Powered Lifecycle Tracing

Mar 04, 2025

Jaekyeom Kim, Sungryull Sohn, Gerrard Jeongwon Jo, Jihoon Choi, Kyunghoon Bae, Hwayoung Lee, Yongmin Park, Honglak Lee

Abstract:This paper argues that a dataset's legal risk cannot be accurately assessed by its license terms alone; instead, tracking dataset redistribution and its full lifecycle is essential. However, this process is too complex for legal experts to handle manually at scale. Tracking dataset provenance, verifying redistribution rights, and assessing evolving legal risks across multiple stages require a level of precision and efficiency that exceeds human capabilities. Addressing this challenge effectively demands AI agents that can systematically trace dataset redistribution, analyze compliance, and identify legal risks. We develop an automated data compliance system called NEXUS and show that AI can perform these tasks with higher accuracy, efficiency, and cost-effectiveness than human experts. Our massive legal analysis of 17,429 unique entities and 8,072 license terms using this approach reveals the discrepancies in legal rights between the original datasets before redistribution and their redistributed subsets, underscoring the necessity of the data lifecycle-aware compliance. For instance, we find that out of 2,852 datasets with commercially viable individual license terms, only 605 (21%) are legally permissible for commercialization. This work sets a new standard for AI data governance, advocating for a framework that systematically examines the entire lifecycle of dataset redistribution to ensure transparent, legal, and responsible dataset management.

Via

Access Paper or Ask Questions

Interactive and Expressive Code-Augmented Planning with Large Language Models

Nov 21, 2024

Anthony Z. Liu, Xinhe Wang, Jacob Sansom, Yao Fu, Jongwook Choi, Sungryull Sohn, Jaekyeom Kim, Honglak Lee

Figure 1 for Interactive and Expressive Code-Augmented Planning with Large Language Models

Figure 2 for Interactive and Expressive Code-Augmented Planning with Large Language Models

Figure 3 for Interactive and Expressive Code-Augmented Planning with Large Language Models

Figure 4 for Interactive and Expressive Code-Augmented Planning with Large Language Models

Abstract:Large Language Models (LLMs) demonstrate strong abilities in common-sense reasoning and interactive decision-making, but often struggle with complex, long-horizon planning tasks. Recent techniques have sought to structure LLM outputs using control flow and other code-adjacent techniques to improve planning performance. These techniques include using variables (to track important information) and functions (to divide complex tasks into smaller re-usable sub-tasks). However, purely code-based approaches can be error-prone and insufficient for handling ambiguous or unstructured data. To address these challenges, we propose REPL-Plan, an LLM planning approach that is fully code-expressive (it can utilize all the benefits of code) while also being dynamic (it can flexibly adapt from errors and use the LLM for fuzzy situations). In REPL-Plan, an LLM solves tasks by interacting with a Read-Eval-Print Loop (REPL), which iteratively executes and evaluates code, similar to language shells or interactive code notebooks, allowing the model to flexibly correct errors and handle tasks dynamically. We demonstrate that REPL-Plan achieves strong results across various planning domains compared to previous methods.

Via

Access Paper or Ask Questions

Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents

Oct 29, 2024

Jaekyeom Kim, Dong-Ki Kim, Lajanugen Logeswaran, Sungryull Sohn, Honglak Lee

Figure 1 for Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents

Figure 2 for Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents

Figure 3 for Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents

Figure 4 for Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents

Abstract:In this paper, we introduce Auto-Intent, a method to adapt a pre-trained large language model (LLM) as an agent for a target domain without direct fine-tuning, where we empirically focus on web navigation tasks. Our approach first discovers the underlying intents from target domain demonstrations unsupervisedly, in a highly compact form (up to three words). With the extracted intents, we train our intent predictor to predict the next intent given the agent's past observations and actions. In particular, we propose a self-exploration approach where top-k probable intent predictions are provided as a hint to the pre-trained LLM agent, which leads to enhanced decision-making capabilities. Auto-Intent substantially improves the performance of GPT-{3.5, 4} and Llama-3.1-{70B, 405B} agents on the large-scale real-website navigation benchmarks from Mind2Web and online navigation tasks from WebArena with its cross-benchmark generalization from Mind2Web.

* EMNLP 2024 Findings

Via

Access Paper or Ask Questions

Small Language Models Need Strong Verifiers to Self-Correct Reasoning

Apr 26, 2024

Yunxiang Zhang, Muhammad Khalifa, Lajanugen Logeswaran, Jaekyeom Kim, Moontae Lee, Honglak Lee, Lu Wang

Figure 1 for Small Language Models Need Strong Verifiers to Self-Correct Reasoning

Figure 2 for Small Language Models Need Strong Verifiers to Self-Correct Reasoning

Figure 3 for Small Language Models Need Strong Verifiers to Self-Correct Reasoning

Figure 4 for Small Language Models Need Strong Verifiers to Self-Correct Reasoning

Abstract:Self-correction has emerged as a promising solution to boost the reasoning performance of large language models (LLMs), where LLMs refine their solutions using self-generated critiques that pinpoint the errors. This work explores whether smaller-size (<= 13B) language models (LMs) have the ability of self-correction on reasoning tasks with minimal inputs from stronger LMs. We propose a novel pipeline that prompts smaller LMs to collect self-correction data that supports the training of self-refinement abilities. First, we leverage correct solutions to guide the model in critiquing their incorrect responses. Second, the generated critiques, after filtering, are used for supervised fine-tuning of the self-correcting reasoner through solution refinement. Our experimental results show improved self-correction abilities of two models on five datasets spanning math and commonsense reasoning, with notable performance gains when paired with a strong GPT-4-based verifier, though limitations are identified when using a weak self-verifier for determining when to correct.

Via

Access Paper or Ask Questions

AutoGuide: Automated Generation and Selection of State-Aware Guidelines for Large Language Model Agents

Mar 13, 2024

Yao Fu, Dong-Ki Kim, Jaekyeom Kim, Sungryull Sohn, Lajanugen Logeswaran, Kyunghoon Bae, Honglak Lee

Figure 1 for AutoGuide: Automated Generation and Selection of State-Aware Guidelines for Large Language Model Agents

Figure 2 for AutoGuide: Automated Generation and Selection of State-Aware Guidelines for Large Language Model Agents

Figure 3 for AutoGuide: Automated Generation and Selection of State-Aware Guidelines for Large Language Model Agents

Figure 4 for AutoGuide: Automated Generation and Selection of State-Aware Guidelines for Large Language Model Agents

Abstract:The primary limitation of large language models (LLMs) is their restricted understanding of the world. This poses significant difficulties for LLM-based agents, particularly in domains where pre-trained LLMs lack sufficient knowledge. In this paper, we introduce a novel framework, called AutoGuide, that bridges the knowledge gap in pre-trained LLMs by leveraging implicit knowledge in offline experiences. Specifically, AutoGuide effectively extracts knowledge embedded in offline data by extracting a set of state-aware guidelines. Importantly, each state-aware guideline is expressed in concise natural language and follows a conditional structure, clearly describing the state where it is applicable. As such, the resulting guidelines enable a principled way to provide helpful knowledge pertinent to an agent's current decision-making process. We show that our approach outperforms competitive LLM-based baselines by a large margin in sequential decision-making benchmarks.

Via

Access Paper or Ask Questions

Lipschitz-constrained Unsupervised Skill Discovery

Feb 08, 2022

Seohong Park, Jongwook Choi, Jaekyeom Kim, Honglak Lee, Gunhee Kim

Figure 1 for Lipschitz-constrained Unsupervised Skill Discovery

Figure 2 for Lipschitz-constrained Unsupervised Skill Discovery

Figure 3 for Lipschitz-constrained Unsupervised Skill Discovery

Figure 4 for Lipschitz-constrained Unsupervised Skill Discovery

Abstract:We study the problem of unsupervised skill discovery, whose goal is to learn a set of diverse and useful skills with no external reward. There have been a number of skill discovery methods based on maximizing the mutual information (MI) between skills and states. However, we point out that their MI objectives usually prefer static skills to dynamic ones, which may hinder the application for downstream tasks. To address this issue, we propose Lipschitz-constrained Skill Discovery (LSD), which encourages the agent to discover more diverse, dynamic, and far-reaching skills. Another benefit of LSD is that its learned representation function can be utilized for solving goal-following downstream tasks even in a zero-shot manner - i.e., without further training or complex planning. Through experiments on various MuJoCo robotic locomotion and manipulation environments, we demonstrate that LSD outperforms previous approaches in terms of skill diversity, state space coverage, and performance on seven downstream tasks including the challenging task of following multiple goals on Humanoid. Our code and videos are available at https://shpark.me/projects/lsd/.

* Accepted to ICLR 2022

Via

Access Paper or Ask Questions

Time Discretization-Invariant Safe Action Repetition for Policy Gradient Methods

Dec 01, 2021

Seohong Park, Jaekyeom Kim, Gunhee Kim

Figure 1 for Time Discretization-Invariant Safe Action Repetition for Policy Gradient Methods

Figure 2 for Time Discretization-Invariant Safe Action Repetition for Policy Gradient Methods

Figure 3 for Time Discretization-Invariant Safe Action Repetition for Policy Gradient Methods

Figure 4 for Time Discretization-Invariant Safe Action Repetition for Policy Gradient Methods

Abstract:In reinforcement learning, continuous time is often discretized by a time scale $\delta$, to which the resulting performance is known to be highly sensitive. In this work, we seek to find a $\delta$-invariant algorithm for policy gradient (PG) methods, which performs well regardless of the value of $\delta$. We first identify the underlying reasons that cause PG methods to fail as $\delta \to 0$, proving that the variance of the PG estimator can diverge to infinity in stochastic environments under a certain assumption of stochasticity. While durative actions or action repetition can be employed to have $\delta$-invariance, previous action repetition methods cannot immediately react to unexpected situations in stochastic environments. We thus propose a novel $\delta$-invariant method named Safe Action Repetition (SAR) applicable to any existing PG algorithm. SAR can handle the stochasticity of environments by adaptively reacting to changes in states during action repetition. We empirically show that our method is not only $\delta$-invariant but also robust to stochasticity, outperforming previous $\delta$-invariant approaches on eight MuJoCo environments with both deterministic and stochastic settings. Our code is available at https://vision.snu.ac.kr/projects/sar.

* Accepted to NeurIPS 2021

Via

Access Paper or Ask Questions

Unsupervised Skill Discovery with Bottleneck Option Learning

Jun 27, 2021

Jaekyeom Kim, Seohong Park, Gunhee Kim

Figure 1 for Unsupervised Skill Discovery with Bottleneck Option Learning

Figure 2 for Unsupervised Skill Discovery with Bottleneck Option Learning

Figure 3 for Unsupervised Skill Discovery with Bottleneck Option Learning

Figure 4 for Unsupervised Skill Discovery with Bottleneck Option Learning

Abstract:Having the ability to acquire inherent skills from environments without any external rewards or supervision like humans is an important problem. We propose a novel unsupervised skill discovery method named Information Bottleneck Option Learning (IBOL). On top of the linearization of environments that promotes more various and distant state transitions, IBOL enables the discovery of diverse skills. It provides the abstraction of the skills learned with the information bottleneck framework for the options with improved stability and encouraged disentanglement. We empirically demonstrate that IBOL outperforms multiple state-of-the-art unsupervised skill discovery methods on the information-theoretic evaluations and downstream tasks in MuJoCo environments, including Ant, HalfCheetah, Hopper and D'Kitty.

* Accepted to ICML 2021. Code at https://vision.snu.ac.kr/projects/ibol

Via

Access Paper or Ask Questions