Picture for Runlong Zhou

Runlong Zhou

Unregularized Linear Convergence in Zero-Sum Game from Preference Feedback

Add code
Dec 31, 2025
Viaarxiv icon

RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments

Add code
Nov 10, 2025
Figure 1 for RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments
Figure 2 for RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments
Figure 3 for RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments
Figure 4 for RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments
Viaarxiv icon

The Ramon Llull's Thinking Machine for Automated Ideation

Add code
Aug 28, 2025
Viaarxiv icon

Sharp Gap-Dependent Variance-Aware Regret Bounds for Tabular MDPs

Add code
Jun 06, 2025
Viaarxiv icon

Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

Add code
May 26, 2025
Viaarxiv icon

CASCADE Your Datasets for Cross-Mode Knowledge Retrieval of Language Models

Add code
Apr 02, 2025
Viaarxiv icon

Extragradient Preference Optimization (EGPO): Beyond Last-Iterate Convergence for Nash Learning from Human Feedback

Add code
Mar 11, 2025
Viaarxiv icon

The Crucial Role of Samplers in Online Direct Preference Optimization

Add code
Sep 29, 2024
Figure 1 for The Crucial Role of Samplers in Online Direct Preference Optimization
Figure 2 for The Crucial Role of Samplers in Online Direct Preference Optimization
Figure 3 for The Crucial Role of Samplers in Online Direct Preference Optimization
Figure 4 for The Crucial Role of Samplers in Online Direct Preference Optimization
Viaarxiv icon

Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques

Add code
Sep 04, 2024
Figure 1 for Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques
Figure 2 for Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques
Figure 3 for Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques
Figure 4 for Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques
Viaarxiv icon

Reflect-RL: Two-Player Online RL Fine-Tuning for LMs

Add code
Feb 20, 2024
Figure 1 for Reflect-RL: Two-Player Online RL Fine-Tuning for LMs
Figure 2 for Reflect-RL: Two-Player Online RL Fine-Tuning for LMs
Figure 3 for Reflect-RL: Two-Player Online RL Fine-Tuning for LMs
Figure 4 for Reflect-RL: Two-Player Online RL Fine-Tuning for LMs
Viaarxiv icon