Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ziniu Hu

Celine

TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference

Sep 18, 2025

Dan Zhang, Min Cai, Jonathan Li, Ziniu Hu, Yisong Yue, Yuxiao Dong, Jie Tang

Figure 1 for TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference

Figure 2 for TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference

Figure 3 for TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference

Figure 4 for TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference

Abstract:Reward models are central to both reinforcement learning (RL) with language models and inference-time verification. However, existing reward models often lack temporal consistency, leading to ineffective policy updates and unstable RL training. We introduce TDRM, a method for learning smoother and more reliable reward models by minimizing temporal differences during training. This temporal-difference (TD) regularization produces smooth rewards and improves alignment with long-term objectives. Incorporating TDRM into the actor-critic style online RL loop yields consistent empirical gains. It is worth noting that TDRM is a supplement to verifiable reward methods, and both can be used in series. Experiments show that TD-trained process reward models (PRMs) improve performance across Best-of-N (up to 6.6%) and tree-search (up to 23.7%) settings. When combined with Reinforcement Learning with Verifiable Rewards (RLVR), TD-trained PRMs lead to more data-efficient RL -- achieving comparable performance with just 2.5k data to what baseline methods require 50.1k data to attain -- and yield higher-quality language model policies on 8 model variants (5 series), e.g., Qwen2.5-(0.5B, 1,5B), GLM4-9B-0414, GLM-Z1-9B-0414, Qwen2.5-Math-(1.5B, 7B), and DeepSeek-R1-Distill-Qwen-(1.5B, 7B). We release all code at https://github.com/THUDM/TDRM.

* 9 figures, 7 tables

Via

Access Paper or Ask Questions

TreeRL: LLM Reinforcement Learning with On-Policy Tree Search

Jun 13, 2025

Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, Yuxiao Dong

Abstract:Reinforcement learning (RL) with tree search has demonstrated superior performance in traditional reasoning tasks. Compared to conventional independent chain sampling strategies with outcome supervision, tree search enables better exploration of the reasoning space and provides dense, on-policy process rewards during RL training but remains under-explored in On-Policy LLM RL. We propose TreeRL, a reinforcement learning framework that directly incorporates on-policy tree search for RL training. Our approach includes intermediate supervision and eliminates the need for a separate reward model training. Existing approaches typically train a separate process reward model, which can suffer from distribution mismatch and reward hacking. We also introduce a cost-effective tree search approach that achieves higher search efficiency under the same generation token budget by strategically branching from high-uncertainty intermediate steps rather than using random branching. Experiments on challenging math and code reasoning benchmarks demonstrate that TreeRL achieves superior performance compared to traditional ChainRL, highlighting the potential of tree search for LLM. TreeRL is open-sourced at https://github.com/THUDM/TreeRL.

* Accepted to ACL 2025 main conference

Via

Access Paper or Ask Questions

Self-Evolving Visual Concept Library using Vision-Language Critics

Mar 31, 2025

Atharva Sehgal, Patrick Yuan, Ziniu Hu, Yisong Yue, Jennifer J. Sun, Swarat Chaudhuri

Abstract:We study the problem of building a visual concept library for visual recognition. Building effective visual concept libraries is challenging, as manual definition is labor-intensive, while relying solely on LLMs for concept generation can result in concepts that lack discriminative power or fail to account for the complex interactions between them. Our approach, ESCHER, takes a library learning perspective to iteratively discover and improve visual concepts. ESCHER uses a vision-language model (VLM) as a critic to iteratively refine the concept library, including accounting for interactions between concepts and how they affect downstream classifiers. By leveraging the in-context learning abilities of LLMs and the history of performance using various concepts, ESCHER dynamically improves its concept generation strategy based on the VLM critic's feedback. Finally, ESCHER does not require any human annotations, and is thus an automated plug-and-play framework. We empirically demonstrate the ability of ESCHER to learn a concept library for zero-shot, few-shot, and fine-tuning visual classification tasks. This work represents, to our knowledge, the first application of concept library learning to real-world visual tasks.

* CVPR camera ready

Via

Access Paper or Ask Questions

DataSciBench: An LLM Agent Benchmark for Data Science

Feb 19, 2025

Dan Zhang, Sining Zhoubian, Min Cai, Fengzu Li, Lekang Yang, Wei Wang, Tianjiao Dong, Ziniu Hu, Jie Tang, Yisong Yue

Figure 1 for DataSciBench: An LLM Agent Benchmark for Data Science

Figure 2 for DataSciBench: An LLM Agent Benchmark for Data Science

Figure 3 for DataSciBench: An LLM Agent Benchmark for Data Science

Figure 4 for DataSciBench: An LLM Agent Benchmark for Data Science

Abstract:This paper presents DataSciBench, a comprehensive benchmark for evaluating Large Language Model (LLM) capabilities in data science. Recent related benchmarks have primarily focused on single tasks, easily obtainable ground truth, and straightforward evaluation metrics, which limits the scope of tasks that can be evaluated. In contrast, DataSciBench is constructed based on a more comprehensive and curated collection of natural and challenging prompts for uncertain ground truth and evaluation metrics. We develop a semi-automated pipeline for generating ground truth (GT) and validating evaluation metrics. This pipeline utilizes and implements an LLM-based self-consistency and human verification strategy to produce accurate GT by leveraging collected prompts, predefined task types, and aggregate functions (metrics). Furthermore, we propose an innovative Task - Function - Code (TFC) framework to assess each code execution outcome based on precisely defined metrics and programmatic rules. Our experimental framework involves testing 6 API-based models, 8 open-source general models, and 9 open-source code generation models using the diverse set of prompts we have gathered. This approach aims to provide a more comprehensive and rigorous evaluation of LLMs in data science, revealing their strengths and weaknesses. Experimental results demonstrate that API-based models outperform open-sourced models on all metrics and Deepseek-Coder-33B-Instruct achieves the highest score among open-sourced models. We release all code and data at https://github.com/THUDM/DataSciBench.

* 40 pages, 7 figures, 6 tables

Via

Access Paper or Ask Questions

QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search

Feb 04, 2025

Zongyu Lin, Yao Tang, Xingcheng Yao, Da Yin, Ziniu Hu, Yizhou Sun, Kai-Wei Chang

Figure 1 for QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search

Figure 2 for QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search

Figure 3 for QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search

Figure 4 for QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search

Abstract:Language agents have become a promising solution to complex interactive tasks. One of the key ingredients to the success of language agents is the reward model on the trajectory of the agentic workflow, which provides valuable guidance during training or inference. However, due to the lack of annotations of intermediate interactions, most existing works use an outcome reward model to optimize policies across entire trajectories. This may lead to sub-optimal policies and hinder the overall performance. To address this, we propose QLASS (Q-guided Language Agent Stepwise Search), to automatically generate annotations by estimating Q-values in a stepwise manner for open language agents. By introducing a reasoning tree and performing process reward modeling, QLASS provides effective intermediate guidance for each step. With the stepwise guidance, we propose a Q-guided generation strategy to enable language agents to better adapt to long-term value, resulting in significant performance improvement during model inference on complex interactive agent tasks. Notably, even with almost half the annotated data, QLASS retains strong performance, demonstrating its efficiency in handling limited supervision. We also empirically demonstrate that QLASS can lead to more effective decision making through qualitative analysis. We will release our code and data.

Via

Access Paper or Ask Questions

Path-RAG: Knowledge-Guided Key Region Retrieval for Open-ended Pathology Visual Question Answering

Nov 26, 2024

Awais Naeem, Tianhao Li, Huang-Ru Liao, Jiawei Xu, Aby M. Mathew, Zehao Zhu, Zhen Tan, Ajay Kumar Jaiswal, Raffi A. Salibian, Ziniu Hu(+2 more)

Figure 1 for Path-RAG: Knowledge-Guided Key Region Retrieval for Open-ended Pathology Visual Question Answering

Figure 2 for Path-RAG: Knowledge-Guided Key Region Retrieval for Open-ended Pathology Visual Question Answering

Figure 3 for Path-RAG: Knowledge-Guided Key Region Retrieval for Open-ended Pathology Visual Question Answering

Figure 4 for Path-RAG: Knowledge-Guided Key Region Retrieval for Open-ended Pathology Visual Question Answering

Abstract:Accurate diagnosis and prognosis assisted by pathology images are essential for cancer treatment selection and planning. Despite the recent trend of adopting deep-learning approaches for analyzing complex pathology images, they fall short as they often overlook the domain-expert understanding of tissue structure and cell composition. In this work, we focus on a challenging Open-ended Pathology VQA (PathVQA-Open) task and propose a novel framework named Path-RAG, which leverages HistoCartography to retrieve relevant domain knowledge from pathology images and significantly improves performance on PathVQA-Open. Admitting the complexity of pathology image analysis, Path-RAG adopts a human-centered AI approach by retrieving domain knowledge using HistoCartography to select the relevant patches from pathology images. Our experiments suggest that domain guidance can significantly boost the accuracy of LLaVA-Med from 38% to 47%, with a notable gain of 28% for H&E-stained pathology images in the PathVQA-Open dataset. For longer-form question and answer pairs, our model consistently achieves significant improvements of 32.5% in ARCH-Open PubMed and 30.6% in ARCH-Open Books on H\&E images. Our code and dataset is available here (https://github.com/embedded-robotics/path-rag).

Via

Access Paper or Ask Questions

PIANIST: Learning Partially Observable World Models with LLMs for Multi-Agent Decision Making

Nov 24, 2024

Jonathan Light, Sixue Xing, Yuanzhe Liu, Weiqin Chen, Min Cai, Xiusi Chen, Guanzhi Wang, Wei Cheng, Yisong Yue, Ziniu Hu

Figure 1 for PIANIST: Learning Partially Observable World Models with LLMs for Multi-Agent Decision Making

Figure 2 for PIANIST: Learning Partially Observable World Models with LLMs for Multi-Agent Decision Making

Figure 3 for PIANIST: Learning Partially Observable World Models with LLMs for Multi-Agent Decision Making

Figure 4 for PIANIST: Learning Partially Observable World Models with LLMs for Multi-Agent Decision Making

Abstract:Effective extraction of the world knowledge in LLMs for complex decision-making tasks remains a challenge. We propose a framework PIANIST for decomposing the world model into seven intuitive components conducive to zero-shot LLM generation. Given only the natural language description of the game and how input observations are formatted, our method can generate a working world model for fast and efficient MCTS simulation. We show that our method works well on two different games that challenge the planning and decision making skills of the agent for both language and non-language based action taking, without any training on domain-specific training data or explicitly defined world model.

* Published at Language Gamification Workshop 2024 @ NeurIPS

Via

Access Paper or Ask Questions

Physics-Informed Regularization for Domain-Agnostic Dynamical System Modeling

Oct 08, 2024

Zijie Huang, Wanjia Zhao, Jingdong Gao, Ziniu Hu, Xiao Luo, Yadi Cao, Yuanzhou Chen, Yizhou Sun, Wei Wang

Figure 1 for Physics-Informed Regularization for Domain-Agnostic Dynamical System Modeling

Figure 2 for Physics-Informed Regularization for Domain-Agnostic Dynamical System Modeling

Figure 3 for Physics-Informed Regularization for Domain-Agnostic Dynamical System Modeling

Figure 4 for Physics-Informed Regularization for Domain-Agnostic Dynamical System Modeling

Abstract:Learning complex physical dynamics purely from data is challenging due to the intrinsic properties of systems to be satisfied. Incorporating physics-informed priors, such as in Hamiltonian Neural Networks (HNNs), achieves high-precision modeling for energy-conservative systems. However, real-world systems often deviate from strict energy conservation and follow different physical priors. To address this, we present a framework that achieves high-precision modeling for a wide range of dynamical systems from the numerical aspect, by enforcing Time-Reversal Symmetry (TRS) via a novel regularization term. It helps preserve energies for conservative systems while serving as a strong inductive bias for non-conservative, reversible systems. While TRS is a domain-specific physical prior, we present the first theoretical proof that TRS loss can universally improve modeling accuracy by minimizing higher-order Taylor terms in ODE integration, which is numerically beneficial to various systems regardless of their properties, even for irreversible systems. By integrating the TRS loss within neural ordinary differential equation models, the proposed model TREAT demonstrates superior performance on diverse physical systems. It achieves a significant 11.5% MSE improvement in a challenging chaotic triple-pendulum scenario, underscoring TREAT's broad applicability and effectiveness.

* Accepted to The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024)

Via

Access Paper or Ask Questions

Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search

Aug 20, 2024

Jonathan Light, Min Cai, Weiqin Chen, Guanzhi Wang, Xiusi Chen, Wei Cheng, Yisong Yue, Ziniu Hu

Figure 1 for Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search

Figure 2 for Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search

Figure 3 for Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search

Figure 4 for Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search

Abstract:In this paper, we propose a new method Strategist that utilizes LLMs to acquire new skills for playing multi-agent games through a self-improvement process. Our method gathers quality feedback through self-play simulations with Monte Carlo tree search and LLM-based reflection, which can then be used to learn high-level strategic skills such as how to evaluate states that guide the low-level execution.We showcase how our method can be used in both action planning and dialogue generation in the context of games, achieving good performance on both tasks. Specifically, we demonstrate that our method can help train agents with better performance than both traditional reinforcement learning-based approaches and other LLM-based skill learning approaches in games including the Game of Pure Strategy (GOPS) and The Resistance: Avalon.

* website: https://llm-strategist.github.io

Via

Access Paper or Ask Questions

Dataset Distillation for Offline Reinforcement Learning

Aug 01, 2024

Jonathan Light, Yuanzhe Liu, Ziniu Hu

Figure 1 for Dataset Distillation for Offline Reinforcement Learning

Figure 2 for Dataset Distillation for Offline Reinforcement Learning

Figure 3 for Dataset Distillation for Offline Reinforcement Learning

Figure 4 for Dataset Distillation for Offline Reinforcement Learning

Abstract:Offline reinforcement learning often requires a quality dataset that we can train a policy on. However, in many situations, it is not possible to get such a dataset, nor is it easy to train a policy to perform well in the actual environment given the offline data. We propose using data distillation to train and distill a better dataset which can then be used for training a better policy model. We show that our method is able to synthesize a dataset where a model trained on it achieves similar performance to a model trained on the full dataset or a model trained using percentile behavioral cloning. Our project site is available at $\href{https://datasetdistillation4rl.github.io}{\text{here}}$. We also provide our implementation at $\href{https://github.com/ggflow123/DDRL}{\text{this GitHub repository}}$.

* ICML 2024 DMLR Workshop

Via

Access Paper or Ask Questions