Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shenghua Wan

Reward Models in Deep Reinforcement Learning: A Survey

Jun 18, 2025

Rui Yu, Shenghua Wan, Yucen Wang, Chen-Xiao Gao, Le Gan, Zongzhang Zhang, De-Chuan Zhan

Abstract:In reinforcement learning (RL), agents continually interact with the environment and use the feedback to refine their behavior. To guide policy optimization, reward models are introduced as proxies of the desired objectives, such that when the agent maximizes the accumulated reward, it also fulfills the task designer's intentions. Recently, significant attention from both academic and industrial researchers has focused on developing reward models that not only align closely with the true objectives but also facilitate policy optimization. In this survey, we provide a comprehensive review of reward modeling techniques within the deep RL literature. We begin by outlining the background and preliminaries in reward modeling. Next, we present an overview of recent reward modeling approaches, categorizing them based on the source, the mechanism, and the learning paradigm. Building on this understanding, we discuss various applications of these reward modeling techniques and review methods for evaluating reward models. Finally, we conclude by highlighting promising research directions in reward modeling. Altogether, this survey includes both established and emerging methods, filling the vacancy of a systematic review of reward models in current literature.

* IJCAI 2025 Survey Track (To Appear)

Via

Access Paper or Ask Questions

SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets

Jun 13, 2024

Shenghua Wan, Ziyuan Chen, Le Gan, Shuai Feng, De-Chuan Zhan

Figure 1 for SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets

Figure 2 for SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets

Figure 3 for SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets

Figure 4 for SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets

Abstract:Model-based offline reinforcement Learning (RL) is a promising approach that leverages existing data effectively in many real-world applications, especially those involving high-dimensional inputs like images and videos. To alleviate the distribution shift issue in offline RL, existing model-based methods heavily rely on the uncertainty of learned dynamics. However, the model uncertainty estimation becomes significantly biased when observations contain complex distractors with non-trivial dynamics. To address this challenge, we propose a new approach - \emph{Separated Model-based Offline Policy Optimization} (SeMOPO) - decomposing latent states into endogenous and exogenous parts via conservative sampling and estimating model uncertainty on the endogenous states only. We provide a theoretical guarantee of model uncertainty and performance bound of SeMOPO. To assess the efficacy, we construct the Low-Quality Vision Deep Data-Driven Datasets for RL (LQV-D4RL), where the data are collected by non-expert policy and the observations include moving distractors. Experimental results show that our method substantially outperforms all baseline methods, and further analytical experiments validate the critical designs in our method. The project website is \href{https://sites.google.com/view/semopo}{https://sites.google.com/view/semopo}.

* 23 pages, 10 figures

Via

Access Paper or Ask Questions

SENSOR: Imitate Third-Person Expert's Behaviors via Active Sensoring

Apr 04, 2024

Kaichen Huang, Minghao Shao, Shenghua Wan, Hai-Hang Sun, Shuai Feng, Le Gan, De-Chuan Zhan

Figure 1 for SENSOR: Imitate Third-Person Expert's Behaviors via Active Sensoring

Figure 2 for SENSOR: Imitate Third-Person Expert's Behaviors via Active Sensoring

Figure 3 for SENSOR: Imitate Third-Person Expert's Behaviors via Active Sensoring

Figure 4 for SENSOR: Imitate Third-Person Expert's Behaviors via Active Sensoring

Abstract:In many real-world visual Imitation Learning (IL) scenarios, there is a misalignment between the agent's and the expert's perspectives, which might lead to the failure of imitation. Previous methods have generally solved this problem by domain alignment, which incurs extra computation and storage costs, and these methods fail to handle the \textit{hard cases} where the viewpoint gap is too large. To alleviate the above problems, we introduce active sensoring in the visual IL setting and propose a model-based SENSory imitatOR (SENSOR) to automatically change the agent's perspective to match the expert's. SENSOR jointly learns a world model to capture the dynamics of latent states, a sensor policy to control the camera, and a motor policy to control the agent. Experiments on visual locomotion tasks show that SENSOR can efficiently simulate the expert's perspective and strategy, and outperforms most baseline methods.

Via

Access Paper or Ask Questions

DIDA: Denoised Imitation Learning based on Domain Adaptation

Apr 04, 2024

Kaichen Huang, Hai-Hang Sun, Shenghua Wan, Minghao Shao, Shuai Feng, Le Gan, De-Chuan Zhan

Figure 1 for DIDA: Denoised Imitation Learning based on Domain Adaptation

Figure 2 for DIDA: Denoised Imitation Learning based on Domain Adaptation

Figure 3 for DIDA: Denoised Imitation Learning based on Domain Adaptation

Figure 4 for DIDA: Denoised Imitation Learning based on Domain Adaptation

Abstract:Imitating skills from low-quality datasets, such as sub-optimal demonstrations and observations with distractors, is common in real-world applications. In this work, we focus on the problem of Learning from Noisy Demonstrations (LND), where the imitator is required to learn from data with noise that often occurs during the processes of data collection or transmission. Previous IL methods improve the robustness of learned policies by injecting an adversarially learned Gaussian noise into pure expert data or utilizing additional ranking information, but they may fail in the LND setting. To alleviate the above problems, we propose Denoised Imitation learning based on Domain Adaptation (DIDA), which designs two discriminators to distinguish the noise level and expertise level of data, facilitating a feature encoder to learn task-related but domain-agnostic representations. Experiment results on MuJoCo demonstrate that DIDA can successfully handle challenging imitation tasks from demonstrations with various types of noise, outperforming most baseline methods.

Via

Access Paper or Ask Questions

AD3: Implicit Action is the Key for World Models to Distinguish the Diverse Visual Distractors

Mar 15, 2024

Yucen Wang, Shenghua Wan, Le Gan, Shuai Feng, De-Chuan Zhan

Abstract:Model-based methods have significantly contributed to distinguishing task-irrelevant distractors for visual control. However, prior research has primarily focused on heterogeneous distractors like noisy background videos, leaving homogeneous distractors that closely resemble controllable agents largely unexplored, which poses significant challenges to existing methods. To tackle this problem, we propose Implicit Action Generator (IAG) to learn the implicit actions of visual distractors, and present a new algorithm named implicit Action-informed Diverse visual Distractors Distinguisher (AD3), that leverages the action inferred by IAG to train separated world models. Implicit actions effectively capture the behavior of background distractors, aiding in distinguishing the task-irrelevant components, and the agent can optimize the policy within the task-relevant state space. Our method achieves superior performance on various visual control tasks featuring both heterogeneous and homogeneous distractors. The indispensable role of implicit actions learned by IAG is also empirically validated.

Via

Access Paper or Ask Questions

SeMAIL: Eliminating Distractors in Visual Imitation via Separated Models

Jun 19, 2023

Shenghua Wan, Yucen Wang, Minghao Shao, Ruying Chen, De-Chuan Zhan

Figure 1 for SeMAIL: Eliminating Distractors in Visual Imitation via Separated Models

Figure 2 for SeMAIL: Eliminating Distractors in Visual Imitation via Separated Models

Figure 3 for SeMAIL: Eliminating Distractors in Visual Imitation via Separated Models

Figure 4 for SeMAIL: Eliminating Distractors in Visual Imitation via Separated Models

Abstract:Model-based imitation learning (MBIL) is a popular reinforcement learning method that improves sample efficiency on high-dimension input sources, such as images and videos. Following the convention of MBIL research, existing algorithms are highly deceptive by task-irrelevant information, especially moving distractors in videos. To tackle this problem, we propose a new algorithm - named Separated Model-based Adversarial Imitation Learning (SeMAIL) - decoupling the environment dynamics into two parts by task-relevant dependency, which is determined by agent actions, and training separately. In this way, the agent can imagine its trajectories and imitate the expert behavior efficiently in task-relevant state space. Our method achieves near-expert performance on various visual control tasks with complex observations and the more challenging tasks with different backgrounds from expert observations.

* 18 pages, 7 figures

Via

Access Paper or Ask Questions