Picture for Hyungjoo Chae

Hyungjoo Chae

ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions

Add code
May 29, 2025
Viaarxiv icon

Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

Add code
May 21, 2025
Figure 1 for Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
Figure 2 for Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
Figure 3 for Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
Figure 4 for Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
Viaarxiv icon

Rethinking Reward Model Evaluation Through the Lens of Reward Overoptimization

Add code
May 19, 2025
Viaarxiv icon

Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation

Add code
Oct 17, 2024
Figure 1 for Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation
Figure 2 for Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation
Figure 3 for Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation
Figure 4 for Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation
Viaarxiv icon

Evaluating Robustness of Reward Models for Mathematical Reasoning

Add code
Oct 02, 2024
Figure 1 for Evaluating Robustness of Reward Models for Mathematical Reasoning
Figure 2 for Evaluating Robustness of Reward Models for Mathematical Reasoning
Figure 3 for Evaluating Robustness of Reward Models for Mathematical Reasoning
Figure 4 for Evaluating Robustness of Reward Models for Mathematical Reasoning
Viaarxiv icon

Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code

Add code
Sep 29, 2024
Figure 1 for Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code
Figure 2 for Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code
Figure 3 for Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code
Figure 4 for Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code
Viaarxiv icon

Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics

Add code
Jun 20, 2024
Figure 1 for Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics
Figure 2 for Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics
Figure 3 for Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics
Figure 4 for Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics
Viaarxiv icon

The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

Add code
Jun 09, 2024
Figure 1 for The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Figure 2 for The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Figure 3 for The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Figure 4 for The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Viaarxiv icon

Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models

Add code
Apr 03, 2024
Figure 1 for Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models
Figure 2 for Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models
Figure 3 for Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models
Figure 4 for Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models
Viaarxiv icon

Evidence-Focused Fact Summarization for Knowledge-Augmented Zero-Shot Question Answering

Add code
Mar 05, 2024
Viaarxiv icon