Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Runlong Zhou

Sharp Gap-Dependent Variance-Aware Regret Bounds for Tabular MDPs

Jun 06, 2025

Shulun Chen, Runlong Zhou, Zihan Zhang, Maryam Fazel, Simon S. Du

Abstract:We consider the gap-dependent regret bounds for episodic MDPs. We show that the Monotonic Value Propagation (MVP) algorithm achieves a variance-aware gap-dependent regret bound of $$\tilde{O}\left(\left(\sum_{\Delta_h(s,a)>0} \frac{H^2 \log K \land \mathtt{Var}_{\max}^{\text{c}}}{\Delta_h(s,a)} +\sum_{\Delta_h(s,a)=0}\frac{ H^2 \land \mathtt{Var}_{\max}^{\text{c}}}{\Delta_{\mathrm{min}}} + SAH^4 (S \lor H) \right) \log K\right),$$ where $H$ is the planning horizon, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes. Here, $\Delta_h(s,a) =V_h^* (a) - Q_h^* (s, a)$ represents the suboptimality gap and $\Delta_{\mathrm{min}} := \min_{\Delta_h (s,a) > 0} \Delta_h(s,a)$. The term $\mathtt{Var}_{\max}^{\text{c}}$ denotes the maximum conditional total variance, calculated as the maximum over all $(\pi, h, s)$ tuples of the expected total variance under policy $\pi$ conditioned on trajectories visiting state $s$ at step $h$. $\mathtt{Var}_{\max}^{\text{c}}$ characterizes the maximum randomness encountered when learning any $(h, s)$ pair. Our result stems from a novel analysis of the weighted sum of the suboptimality gap and can be potentially adapted for other algorithms. To complement the study, we establish a lower bound of $$\Omega \left( \sum_{\Delta_h(s,a)>0} \frac{H^2 \land \mathtt{Var}_{\max}^{\text{c}}}{\Delta_h(s,a)}\cdot \log K\right),$$ demonstrating the necessity of dependence on $\mathtt{Var}_{\max}^{\text{c}}$ even when the maximum unconditional total variance (without conditioning on $(h, s)$) approaches zero.

* 30 pages

Via

Access Paper or Ask Questions

Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

May 26, 2025

Ruizhe Shi, Minhak Song, Runlong Zhou, Zihan Zhang, Maryam Fazel, Simon S. Du

Abstract:We present a fine-grained theoretical analysis of the performance gap between reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under a representation gap. Our study decomposes this gap into two sources: an explicit representation gap under exact optimization and an implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on the type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is implicitly sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model -- highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.

* 30 pages, 5 figures

Via

Access Paper or Ask Questions

CASCADE Your Datasets for Cross-Mode Knowledge Retrieval of Language Models

Apr 02, 2025

Runlong Zhou, Yi Zhang

Abstract:Language models often struggle with cross-mode knowledge retrieval -- the ability to access knowledge learned in one format (mode) when queried in another. We demonstrate that models trained on multiple data sources (e.g., Wikipedia and TinyStories) exhibit significantly reduced accuracy when retrieving knowledge in a format different from its original training mode. This paper quantitatively investigates this phenomenon through a controlled study of random token sequence memorization across different modes. We first explore dataset rewriting as a solution, revealing that effective cross-mode retrieval requires prohibitively extensive rewriting efforts that follow a sigmoid-like relationship. As an alternative, we propose CASCADE, a novel pretraining algorithm that uses cascading datasets with varying sequence lengths to capture knowledge at different scales. Our experiments demonstrate that CASCADE outperforms dataset rewriting approaches, even when compressed into a single model with a unified loss function. This work provides both qualitative evidence of cross-mode retrieval limitations and a practical solution to enhance language models' ability to access knowledge independently of its presentational format.

Via

Access Paper or Ask Questions

Extragradient Preference Optimization (EGPO): Beyond Last-Iterate Convergence for Nash Learning from Human Feedback

Mar 11, 2025

Runlong Zhou, Maryam Fazel, Simon S. Du

Abstract:Reinforcement learning from human feedback (RLHF) has become essential for improving language model capabilities, but traditional approaches rely on the assumption that human preferences follow a transitive Bradley-Terry model. This assumption fails to capture the non-transitive nature of populational human preferences. Nash learning from human feedback (NLHF), targeting non-transitive preferences, is a problem of computing the Nash equilibrium (NE) of the two-player constant-sum game defined by the human preference. We introduce Extragradient preference optimization (EGPO), a novel algorithm for NLHF achieving last-iterate linear convergence to the NE of KL-regularized games and polynomial convergence to the NE of original games, while being robust to noise. Unlike previous approaches that rely on nested optimization, we derive an equivalent implementation using gradients of an online variant of the identity preference optimization (IPO) loss, enabling more faithful implementation for neural networks. Our empirical evaluations demonstrate EGPO's superior performance over baseline methods when training for the same number of epochs, as measured by pairwise win-rates using the ground truth preference. These results validate both the theoretical strengths and practical advantages of EGPO for language model alignment with non-transitive human preferences.

Via

Access Paper or Ask Questions

The Crucial Role of Samplers in Online Direct Preference Optimization

Sep 29, 2024

Ruizhe Shi, Runlong Zhou, Simon S. Du

Figure 1 for The Crucial Role of Samplers in Online Direct Preference Optimization

Figure 2 for The Crucial Role of Samplers in Online Direct Preference Optimization

Figure 3 for The Crucial Role of Samplers in Online Direct Preference Optimization

Figure 4 for The Crucial Role of Samplers in Online Direct Preference Optimization

Abstract:Direct Preference Optimization (DPO) has emerged as a stable, scalable, and efficient solution for language model alignment. Despite its empirical success, the $\textit{optimization}$ properties, particularly the impact of samplers on its convergence rates, remain underexplored. In this paper, we provide a rigorous analysis of DPO's $\textit{convergence rates}$ with different sampling strategies under the exact gradient setting, revealing a surprising separation: uniform sampling achieves $\textit{linear}$ convergence, while our proposed online sampler achieves $\textit{quadratic}$ convergence. We further adapt the sampler to practical settings by incorporating posterior distributions and $\textit{logit mixing}$, demonstrating significant improvements over previous approaches. On Safe-RLHF dataset, our method exhibits a $4.5$% improvement over vanilla DPO and a $3.0$% improvement over on-policy DPO; on Iterative-Prompt, our approach outperforms vanilla DPO, on-policy DPO, and Hybrid GSHF by over $4.2$%. Our results not only offer insights into the theoretical standing of DPO but also pave the way for potential algorithm designs in the future.

* 33 pages

Via

Access Paper or Ask Questions

Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques

Sep 04, 2024

Natalia Zhang, Xinqi Wang, Qiwen Cui, Runlong Zhou, Sham M. Kakade, Simon S. Du

Figure 1 for Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques

Figure 2 for Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques

Figure 3 for Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques

Figure 4 for Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques

Abstract:We initiate the study of Multi-Agent Reinforcement Learning from Human Feedback (MARLHF), exploring both theoretical foundations and empirical validations. We define the task as identifying Nash equilibrium from a preference-only offline dataset in general-sum games, a problem marked by the challenge of sparse feedback signals. Our theory establishes the upper complexity bounds for Nash Equilibrium in effective MARLHF, demonstrating that single-policy coverage is inadequate and highlighting the importance of unilateral dataset coverage. These theoretical insights are verified through comprehensive experiments. To enhance the practical performance, we further introduce two algorithmic techniques. (1) We propose a Mean Squared Error (MSE) regularization along the time axis to achieve a more uniform reward distribution and improve reward learning outcomes. (2) We utilize imitation learning to approximate the reference policy, ensuring stability and effectiveness in training. Our findings underscore the multifaceted approach required for MARLHF, paving the way for effective preference-based multi-agent systems.

Via

Access Paper or Ask Questions

Reflect-RL: Two-Player Online RL Fine-Tuning for LMs

Feb 20, 2024

Runlong Zhou, Simon S. Du, Beibin Li

Figure 1 for Reflect-RL: Two-Player Online RL Fine-Tuning for LMs

Figure 2 for Reflect-RL: Two-Player Online RL Fine-Tuning for LMs

Figure 3 for Reflect-RL: Two-Player Online RL Fine-Tuning for LMs

Figure 4 for Reflect-RL: Two-Player Online RL Fine-Tuning for LMs

Abstract:As language models (LMs) demonstrate their capabilities in various fields, their application to tasks requiring multi-round interactions has become increasingly popular. These tasks usually have complex dynamics, so supervised fine-tuning (SFT) on a limited offline dataset does not yield good performance. However, only a few works attempted to directly train the LMs within interactive decision-making environments. We aim to create an effective mechanism to fine-tune LMs with online reinforcement learning (RL) in these environments. We propose Reflect-RL, a two-player system to fine-tune an LM using online RL, where a frozen reflection model assists the policy model. To generate data for the warm-up SFT stage, we use negative example generation to enhance the error-correction ability of the reflection model. Furthermore, we designed single-prompt action enumeration and applied curriculum learning to allow the policy model to learn more efficiently. Empirically, we verify that Reflect-RL outperforms SFT and online RL without reflection. Testing results indicate GPT-2-xl after Reflect-RL also outperforms those of untuned pre-trained LMs, such as Mistral 7B.

* 25 pages, 13 figures

Via

Access Paper or Ask Questions

Free from Bellman Completeness: Trajectory Stitching via Model-based Return-conditioned Supervised Learning

Oct 30, 2023

Zhaoyi Zhou, Chuning Zhu, Runlong Zhou, Qiwen Cui, Abhishek Gupta, Simon Shaolei Du

Abstract:Off-policy dynamic programming (DP) techniques such as $Q$-learning have proven to be an important technique for solving sequential decision-making problems. However, in the presence of function approximation such algorithms are not guaranteed to converge, often diverging due to the absence of Bellman-completeness in the function classes considered, a crucial condition for the success of DP-based methods. In this paper, we show how off-policy learning techniques based on return-conditioned supervised learning (RCSL) are able to circumvent these challenges of Bellman completeness, converging under significantly more relaxed assumptions inherited from supervised learning. We prove there exists a natural environment in which if one uses two-layer multilayer perceptron as the function approximator, the layer width needs to grow linearly with the state space size to satisfy Bellman-completeness while a constant layer width is enough for RCSL. These findings take a step towards explaining the superior empirical performance of RCSL methods compared to DP-based methods in environments with near-optimal datasets. Furthermore, in order to learn from sub-optimal datasets, we propose a simple framework called MBRCSL, granting RCSL methods the ability of dynamic programming to stitch together segments from distinct trajectories. MBRCSL leverages learned dynamics models and forward sampling to accomplish trajectory stitching while avoiding the need for Bellman completeness that plagues all dynamic programming algorithms. We propose both theoretical analysis and experimental evaluation to back these claims, outperforming state-of-the-art model-free and model-based offline RL algorithms across several simulated robotics problems.

Via

Access Paper or Ask Questions

Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both Worlds in Stochastic and Deterministic Environments

Jan 31, 2023

Runlong Zhou, Zihan Zhang, Simon S. Du

Abstract:We study variance-dependent regret bounds for Markov decision processes (MDPs). Algorithms with variance-dependent regret guarantees can automatically exploit environments with low variance (e.g., enjoying constant regret on deterministic MDPs). The existing algorithms are either variance-independent or suboptimal. We first propose two new environment norms to characterize the fine-grained variance properties of the environment. For model-based methods, we design a variant of the MVP algorithm (Zhang et al., 2021a) and use new analysis techniques show to this algorithm enjoys variance-dependent bounds with respect to our proposed norms. In particular, this bound is simultaneously minimax optimal for both stochastic and deterministic MDPs, the first result of its kind. We further initiate the study on model-free algorithms with variance-dependent regret bounds by designing a reference-function-based algorithm with a novel capped-doubling reference update schedule. Lastly, we also provide lower bounds to complement our upper bounds.

* 43 pages, 1 figure

Via

Access Paper or Ask Questions

Understanding Curriculum Learning in Policy Optimization for Solving Combinatorial Optimization Problems

Feb 11, 2022

Runlong Zhou, Yuandong Tian, Yi Wu, Simon S. Du

Figure 1 for Understanding Curriculum Learning in Policy Optimization for Solving Combinatorial Optimization Problems

Figure 2 for Understanding Curriculum Learning in Policy Optimization for Solving Combinatorial Optimization Problems

Figure 3 for Understanding Curriculum Learning in Policy Optimization for Solving Combinatorial Optimization Problems

Figure 4 for Understanding Curriculum Learning in Policy Optimization for Solving Combinatorial Optimization Problems

Abstract:Over the recent years, reinforcement learning (RL) has shown impressive performance in finding strategic solutions for game environments, and recently starts to show promising results in solving combinatorial optimization (CO) problems, inparticular when coupled with curriculum learning to facilitate training. Despite emerging empirical evidence, theoretical study on why RL helps is still at its early stage. This paper presents the first systematic study on policy optimization methods for solving CO problems. We show that CO problems can be naturally formulated as latent Markov Decision Processes (LMDPs), and prove convergence bounds on natural policy gradient (NPG) for solving LMDPs. Furthermore, our theory explains the benefit of curriculum learning: it can find a strong sampling policy and reduce the distribution shift, a critical quantity that governs the convergence rate in our theorem. For a canonical combinatorial problem, Secretary Problem, we formally prove that distribution shift is reduced exponentially with curriculum learning. Our theory also shows we can simplify the curriculum learning scheme used in prior work from multi-step to single-step. Lastly, we provide extensive experiments on Secretary Problem and Online Knapsack to empirically verify our findings.

* 28 pages, 8 figures

Via

Access Paper or Ask Questions