Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Faguo Wu

Offline RL with Smooth OOD Generalization in Convex Hull and its Neighborhood

Jun 10, 2025

Qingmao Yao, Zhichao Lei, Tianyuan Chen, Ziyue Yuan, Xuefan Chen, Jianxiang Liu, Faguo Wu, Xiao Zhang

Abstract:Offline Reinforcement Learning (RL) struggles with distributional shifts, leading to the $Q$-value overestimation for out-of-distribution (OOD) actions. Existing methods address this issue by imposing constraints; however, they often become overly conservative when evaluating OOD regions, which constrains the $Q$-function generalization. This over-constraint issue results in poor $Q$-value estimation and hinders policy improvement. In this paper, we introduce a novel approach to achieve better $Q$-value estimation by enhancing $Q$-function generalization in OOD regions within Convex Hull and its Neighborhood (CHN). Under the safety generalization guarantees of the CHN, we propose the Smooth Bellman Operator (SBO), which updates OOD $Q$-values by smoothing them with neighboring in-sample $Q$-values. We theoretically show that SBO approximates true $Q$-values for both in-sample and OOD actions within the CHN. Our practical algorithm, Smooth Q-function OOD Generalization (SQOG), empirically alleviates the over-constraint issue, achieving near-accurate $Q$-value estimation. On the D4RL benchmarks, SQOG outperforms existing state-of-the-art methods in both performance and computational efficiency.

* ICLR 2025

Via

Access Paper or Ask Questions

Preference-Guided Reinforcement Learning for Efficient Exploration

Jul 09, 2024

Guojian Wang, Faguo Wu, Xiao Zhang, Tianyuan Chen, Xuyang Chen, Lin Zhao

Figure 1 for Preference-Guided Reinforcement Learning for Efficient Exploration

Figure 2 for Preference-Guided Reinforcement Learning for Efficient Exploration

Figure 3 for Preference-Guided Reinforcement Learning for Efficient Exploration

Figure 4 for Preference-Guided Reinforcement Learning for Efficient Exploration

Abstract:In this paper, we investigate preference-based reinforcement learning (PbRL) that allows reinforcement learning (RL) agents to learn from human feedback. This is particularly valuable when defining a fine-grain reward function is not feasible. However, this approach is inefficient and impractical for promoting deep exploration in hard-exploration tasks with long horizons and sparse rewards. To tackle this issue, we introduce LOPE: Learning Online with trajectory Preference guidancE, an end-to-end preference-guided RL framework that enhances exploration efficiency in hard-exploration tasks. Our intuition is that LOPE directly adjusts the focus of online exploration by considering human feedback as guidance, avoiding learning a separate reward model from preferences. Specifically, LOPE includes a two-step sequential policy optimization process consisting of trust-region-based policy improvement and preference guidance steps. We reformulate preference guidance as a novel trajectory-wise state marginal matching problem that minimizes the maximum mean discrepancy distance between the preferred trajectories and the learned policy. Furthermore, we provide a theoretical analysis to characterize the performance improvement bound and evaluate the LOPE's effectiveness. When assessed in various challenging hard-exploration environments, LOPE outperforms several state-of-the-art methods regarding convergence rate and overall performance. The code used in this study is available at \url{https://github.com/buaawgj/LOPE}.

* 13 pages, 17 figures

Via

Access Paper or Ask Questions

Learning Diverse Policies with Soft Self-Generated Guidance

Feb 07, 2024

Guojian Wang, Faguo Wu, Xiao Zhang, Jianxiang Liu

Figure 1 for Learning Diverse Policies with Soft Self-Generated Guidance

Figure 2 for Learning Diverse Policies with Soft Self-Generated Guidance

Figure 3 for Learning Diverse Policies with Soft Self-Generated Guidance

Figure 4 for Learning Diverse Policies with Soft Self-Generated Guidance

Abstract:Reinforcement learning (RL) with sparse and deceptive rewards is challenging because non-zero rewards are rarely obtained. Hence, the gradient calculated by the agent can be stochastic and without valid information. Recent studies that utilize memory buffers of previous experiences can lead to a more efficient learning process. However, existing methods often require these experiences to be successful and may overly exploit them, which can cause the agent to adopt suboptimal behaviors. This paper develops an approach that uses diverse past trajectories for faster and more efficient online RL, even if these trajectories are suboptimal or not highly rewarded. The proposed algorithm combines a policy improvement step with an additional exploration step using offline demonstration data. The main contribution of this paper is that by regarding diverse past trajectories as guidance, instead of imitating them, our method directs its policy to follow and expand past trajectories while still being able to learn without rewards and approach optimality. Furthermore, a novel diversity measurement is introduced to maintain the team's diversity and regulate exploration. The proposed algorithm is evaluated on discrete and continuous control tasks with sparse and deceptive rewards. Compared with the existing RL methods, the experimental results indicate that our proposed algorithm is significantly better than the baseline methods regarding diverse exploration and avoiding local optima.

* International Journal of Intelligent Systems, Volume 2023
* 23 pages, 19 figures

Via

Access Paper or Ask Questions

Trajectory-Oriented Policy Optimization with Sparse Rewards

Jan 04, 2024

Guojian Wang, Faguo Wu, Xiao Zhang

Figure 1 for Trajectory-Oriented Policy Optimization with Sparse Rewards

Figure 2 for Trajectory-Oriented Policy Optimization with Sparse Rewards

Figure 3 for Trajectory-Oriented Policy Optimization with Sparse Rewards

Figure 4 for Trajectory-Oriented Policy Optimization with Sparse Rewards

Abstract:Deep reinforcement learning (DRL) remains challenging in tasks with sparse rewards. These sparse rewards often only indicate whether the task is partially or fully completed, meaning that many exploration actions must be performed before the agent obtains useful feedback. Hence, most existing DRL algorithms fail to learn feasible policies within a reasonable time frame. To overcome this problem, we develop an approach that exploits offline demonstration trajectories for faster and more efficient online RL in sparse reward settings. Our key insight is that by regarding offline demonstration trajectories as guidance, instead of imitating them, our method learns a policy whose state-action visitation marginal distribution matches that of offline demonstrations. Specifically, we introduce a novel trajectory distance based on maximum mean discrepancy (MMD) and formulate policy optimization as a distance-constrained optimization problem. Then, we show that this distance-constrained optimization problem can be reduced into a policy-gradient algorithm with shaped rewards learned from offline demonstrations. The proposed algorithm is evaluated on extensive discrete and continuous control tasks with sparse and deceptive rewards. The experimental results indicate that our proposed algorithm is significantly better than the baseline methods regarding diverse exploration and learning the optimal policy.

* 5 pages, 7 figures

Via

Access Paper or Ask Questions

Policy Optimization with Smooth Guidance Rewards Learned from Sparse-Reward Demonstrations

Dec 30, 2023

Guojian Wang, Faguo Wu, Xiao Zhang, Tianyuan Chen

Figure 1 for Policy Optimization with Smooth Guidance Rewards Learned from Sparse-Reward Demonstrations

Figure 2 for Policy Optimization with Smooth Guidance Rewards Learned from Sparse-Reward Demonstrations

Figure 3 for Policy Optimization with Smooth Guidance Rewards Learned from Sparse-Reward Demonstrations

Figure 4 for Policy Optimization with Smooth Guidance Rewards Learned from Sparse-Reward Demonstrations

Abstract:The sparsity of reward feedback remains a challenging problem in online deep reinforcement learning (DRL). Previous approaches have utilized temporal credit assignment (CA) to achieve impressive results in multiple hard tasks. However, many CA methods relied on complex architectures or introduced sensitive hyperparameters to estimate the impact of state-action pairs. Meanwhile, the premise of the feasibility of CA methods is to obtain trajectories with sparse rewards, which can be troublesome in sparse-reward environments with large state spaces. To tackle these problems, we propose a simple and efficient algorithm called Policy Optimization with Smooth Guidance (POSG) that leverages a small set of sparse-reward demonstrations to make reliable and effective long-term credit assignments while efficiently facilitating exploration. The key idea is that the relative impact of state-action pairs can be indirectly estimated using offline demonstrations rather than directly leveraging the sparse reward trajectories generated by the agent. Specifically, we first obtain the trajectory importance by considering both the trajectory-level distance to demonstrations and the returns of the relevant trajectories. Then, the guidance reward is calculated for each state-action pair by smoothly averaging the importance of the trajectories through it, merging the demonstration's distribution and reward information. We theoretically analyze the performance improvement bound caused by smooth guidance rewards and derive a new worst-case lower bound on the performance improvement. Extensive results demonstrate POSG's significant advantages in control performance and convergence speed compared to benchmark DRL algorithms. Notably, the specific metrics and quantifiable results are investigated to demonstrate the superiority of POSG.

* 31 pages, 23 figures

Via

Access Paper or Ask Questions

Adaptive trajectory-constrained exploration strategy for deep reinforcement learning

Dec 27, 2023

Guojian Wang, Faguo Wu, Xiao Zhang, Ning Guo, Zhiming Zheng

Figure 1 for Adaptive trajectory-constrained exploration strategy for deep reinforcement learning

Figure 2 for Adaptive trajectory-constrained exploration strategy for deep reinforcement learning

Figure 3 for Adaptive trajectory-constrained exploration strategy for deep reinforcement learning

Figure 4 for Adaptive trajectory-constrained exploration strategy for deep reinforcement learning

Abstract:Deep reinforcement learning (DRL) faces significant challenges in addressing the hard-exploration problems in tasks with sparse or deceptive rewards and large state spaces. These challenges severely limit the practical application of DRL. Most previous exploration methods relied on complex architectures to estimate state novelty or introduced sensitive hyperparameters, resulting in instability. To mitigate these issues, we propose an efficient adaptive trajectory-constrained exploration strategy for DRL. The proposed method guides the policy of the agent away from suboptimal solutions by leveraging incomplete offline demonstrations as references. This approach gradually expands the exploration scope of the agent and strives for optimality in a constrained optimization manner. Additionally, we introduce a novel policy-gradient-based optimization algorithm that utilizes adaptively clipped trajectory-distance rewards for both single- and multi-agent reinforcement learning. We provide a theoretical analysis of our method, including a deduction of the worst-case approximation error bounds, highlighting the validity of our approach for enhancing exploration. To evaluate the effectiveness of the proposed method, we conducted experiments on two large 2D grid world mazes and several MuJoCo tasks. The extensive experimental results demonstrate the significant advantages of our method in achieving temporally extended exploration and avoiding myopic and suboptimal behaviors in both single- and multi-agent settings. Notably, the specific metrics and quantifiable results further support these findings. The code used in the study is available at \url{https://github.com/buaawgj/TACE}.

* Knowledge-Based Systems 285 (2024) 111334
* 35 pages, 36 figures; accepted by Knowledge-Based Systems, not published

Via

Access Paper or Ask Questions

PINF: Continuous Normalizing Flows for Physics-Constrained Deep Learning

Sep 26, 2023

Feng Liu, Faguo Wu, Xiao Zhang

Abstract:The normalization constraint on probability density poses a significant challenge for solving the Fokker-Planck equation. Normalizing Flow, an invertible generative model leverages the change of variables formula to ensure probability density conservation and enable the learning of complex data distributions. In this paper, we introduce Physics-Informed Normalizing Flows (PINF), a novel extension of continuous normalizing flows, incorporating diffusion through the method of characteristics. Our method, which is mesh-free and causality-free, can efficiently solve high dimensional time-dependent and steady-state Fokker-Planck equations.

* 12 pages, 3 figures

Via

Access Paper or Ask Questions