Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Paul Weng

Time Reversal Symmetry for Efficient Robotic Manipulations in Deep Reinforcement Learning

May 20, 2025

Yunpeng Jiang, Jianshu Hu, Paul Weng, Yutong Ban

Abstract:Symmetry is pervasive in robotics and has been widely exploited to improve sample efficiency in deep reinforcement learning (DRL). However, existing approaches primarily focus on spatial symmetries, such as reflection, rotation, and translation, while largely neglecting temporal symmetries. To address this gap, we explore time reversal symmetry, a form of temporal symmetry commonly found in robotics tasks such as door opening and closing. We propose Time Reversal symmetry enhanced Deep Reinforcement Learning (TR-DRL), a framework that combines trajectory reversal augmentation and time reversal guided reward shaping to efficiently solve temporally symmetric tasks. Our method generates reversed transitions from fully reversible transitions, identified by a proposed dynamics-consistent filter, to augment the training data. For partially reversible transitions, we apply reward shaping to guide learning, according to successful trajectories from the reversed task. Extensive experiments on the Robosuite and MetaWorld benchmarks demonstrate that TR-DRL is effective in both single-task and multi-task settings, achieving higher sample efficiency and stronger final performance compared to baseline methods.

Via

Access Paper or Ask Questions

ASAP: Learning Generalizable Online Bin Packing via Adaptive Selection After Pruning

Jan 29, 2025

Han Fang, Paul Weng, Yutong Ban

Abstract:Recently, deep reinforcement learning (DRL) has achieved promising results in solving online 3D Bin Packing Problems (3D-BPP). However, these DRL-based policies may perform poorly on new instances due to distribution shift. Besides generalization, we also consider adaptation, completely overlooked by previous work, which aims at rapidly finetuning these policies to a new test distribution. To tackle both generalization and adaptation issues, we propose Adaptive Selection After Pruning (ASAP), which decomposes a solver's decision-making into two policies, one for pruning and one for selection. The role of the pruning policy is to remove inherently bad actions, which allows the selection policy to choose among the remaining most valuable actions. To learn these policies, we propose a training scheme based on a meta-learning phase of both policies followed by a finetuning phase of the sole selection policy to rapidly adapt it to a test distribution. Our experiments demonstrate that ASAP exhibits excellent generalization and adaptation capabilities on in-distribution and out-of-distribution instances under both discrete and continuous setup.

Via

Access Paper or Ask Questions

Enhancing Online Reinforcement Learning with Meta-Learned Objective from Offline Data

Jan 13, 2025

Shilong Deng, Zetao Zheng, Hongcai He, Paul Weng, Jie Shao

Abstract:A major challenge in Reinforcement Learning (RL) is the difficulty of learning an optimal policy from sparse rewards. Prior works enhance online RL with conventional Imitation Learning (IL) via a handcrafted auxiliary objective, at the cost of restricting the RL policy to be sub-optimal when the offline data is generated by a non-expert policy. Instead, to better leverage valuable information in offline data, we develop Generalized Imitation Learning from Demonstration (GILD), which meta-learns an objective that distills knowledge from offline data and instills intrinsic motivation towards the optimal policy. Distinct from prior works that are exclusive to a specific RL algorithm, GILD is a flexible module intended for diverse vanilla off-policy RL algorithms. In addition, GILD introduces no domain-specific hyperparameter and minimal increase in computational cost. In four challenging MuJoCo tasks with sparse rewards, we show that three RL algorithms enhanced with GILD significantly outperform state-of-the-art methods.

* Accepted by AAAI 2025 (this version includes supplementary material)

Via

Access Paper or Ask Questions

Imitation Learning from Suboptimal Demonstrations via Meta-Learning An Action Ranker

Dec 28, 2024

Jiangdong Fan, Hongcai He, Paul Weng, Hui Xu, Jie Shao

Figure 1 for Imitation Learning from Suboptimal Demonstrations via Meta-Learning An Action Ranker

Figure 2 for Imitation Learning from Suboptimal Demonstrations via Meta-Learning An Action Ranker

Figure 3 for Imitation Learning from Suboptimal Demonstrations via Meta-Learning An Action Ranker

Figure 4 for Imitation Learning from Suboptimal Demonstrations via Meta-Learning An Action Ranker

Abstract:A major bottleneck in imitation learning is the requirement of a large number of expert demonstrations, which can be expensive or inaccessible. Learning from supplementary demonstrations without strict quality requirements has emerged as a powerful paradigm to address this challenge. However, previous methods often fail to fully utilize their potential by discarding non-expert data. Our key insight is that even demonstrations that fall outside the expert distribution but outperform the learned policy can enhance policy performance. To utilize this potential, we propose a novel approach named imitation learning via meta-learning an action ranker (ILMAR). ILMAR implements weighted behavior cloning (weighted BC) on a limited set of expert demonstrations along with supplementary demonstrations. It utilizes the functional of the advantage function to selectively integrate knowledge from the supplementary demonstrations. To make more effective use of supplementary demonstrations, we introduce meta-goal in ILMAR to optimize the functional of the advantage function by explicitly minimizing the distance between the current policy and the expert policy. Comprehensive experiments using extensive tasks demonstrate that ILMAR significantly outperforms previous methods in handling suboptimal demonstrations. Code is available at https://github.com/F-GOD6/ILMAR.

Via

Access Paper or Ask Questions

State-Novelty Guided Action Persistence in Deep Reinforcement Learning

Sep 09, 2024

Jianshu Hu, Paul Weng, Yutong Ban

Abstract:While a powerful and promising approach, deep reinforcement learning (DRL) still suffers from sample inefficiency, which can be notably improved by resorting to more sophisticated techniques to address the exploration-exploitation dilemma. One such technique relies on action persistence (i.e., repeating an action over multiple steps). However, previous work exploiting action persistence either applies a fixed strategy or learns additional value functions (or policy) for selecting the repetition number. In this paper, we propose a novel method to dynamically adjust the action persistence based on the current exploration status of the state space. In such a way, our method does not require training of additional value functions or policy. Moreover, the use of a smooth scheduling of the repeat probability allows a more effective balance between exploration and exploitation. Furthermore, our method can be seamlessly integrated into various basic exploration strategies to incorporate temporal persistence. Finally, extensive experiments on different DMControl tasks demonstrate that our state-novelty guided action persistence method significantly improves the sample efficiency.

* Under review

Via

Access Paper or Ask Questions

Revisiting Data Augmentation in Deep Reinforcement Learning

Feb 19, 2024

Jianshu Hu, Yunpeng Jiang, Paul Weng

Figure 1 for Revisiting Data Augmentation in Deep Reinforcement Learning

Figure 2 for Revisiting Data Augmentation in Deep Reinforcement Learning

Figure 3 for Revisiting Data Augmentation in Deep Reinforcement Learning

Figure 4 for Revisiting Data Augmentation in Deep Reinforcement Learning

Abstract:Various data augmentation techniques have been recently proposed in image-based deep reinforcement learning (DRL). Although they empirically demonstrate the effectiveness of data augmentation for improving sample efficiency or generalization, which technique should be preferred is not always clear. To tackle this question, we analyze existing methods to better understand them and to uncover how they are connected. Notably, by expressing the variance of the Q-targets and that of the empirical actor/critic losses of these methods, we can analyze the effects of their different components and compare them. We furthermore formulate an explanation about how these methods may be affected by choosing different data augmentation transformations in calculating the target Q-values. This analysis suggests recommendations on how to exploit data augmentation in a more principled way. In addition, we include a regularization term called tangent prop, previously proposed in computer vision, but whose adaptation to DRL is novel to the best of our knowledge. We evaluate our proposition and validate our analysis in several domains. Compared to different relevant baselines, we demonstrate that it achieves state-of-the-art performance in most environments and shows higher sample efficiency and better generalization ability in some complex environments.

* Accepted in ICLR 2024

Via

Access Paper or Ask Questions

INViT: A Generalizable Routing Problem Solver with Invariant Nested View Transformer

Feb 12, 2024

Han Fang, Zhihao Song, Paul Weng, Yutong Ban

Abstract:Recently, deep reinforcement learning has shown promising results for learning fast heuristics to solve routing problems. Meanwhile, most of the solvers suffer from generalizing to an unseen distribution or distributions with different scales. To address this issue, we propose a novel architecture, called Invariant Nested View Transformer (INViT), which is designed to enforce a nested design together with invariant views inside the encoders to promote the generalizability of the learned solver. It applies a modified policy gradient algorithm enhanced with data augmentations. We demonstrate that the proposed INViT achieves a dominant generalization performance on both TSP and CVRP problems with various distributions and different problem scales.

Via

Access Paper or Ask Questions

A Survey of Reinforcement Learning from Human Feedback

Dec 22, 2023

Timo Kaufmann, Paul Weng, Viktor Bengs, Eyke Hüllermeier

Abstract:Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning (RL) that learns from human feedback instead of relying on an engineered reward function. Building on prior work on the related setting of preference-based reinforcement learning (PbRL), it stands at the intersection of artificial intelligence and human-computer interaction. This positioning offers a promising avenue to enhance the performance and adaptability of intelligent systems while also improving the alignment of their objectives with human values. The training of Large Language Models (LLMs) has impressively demonstrated this potential in recent years, where RLHF played a decisive role in targeting the model's capabilities toward human objectives. This article provides a comprehensive overview of the fundamentals of RLHF, exploring the intricate dynamics between machine agents and human input. While recent focus has been on RLHF for LLMs, our survey adopts a broader perspective, examining the diverse applications and wide-ranging impact of the technique. We delve into the core principles that underpin RLHF, shedding light on the symbiotic relationship between algorithms and human feedback, and discuss the main research trends in the field. By synthesizing the current landscape of RLHF research, this article aims to provide researchers as well as practitioners with a comprehensive understanding of this rapidly growing field of research.

Via

Access Paper or Ask Questions

Learning Rewards to Optimize Global Performance Metrics in Deep Reinforcement Learning

Mar 16, 2023

Junqi Qian, Paul Weng, Chenmien Tan

Figure 1 for Learning Rewards to Optimize Global Performance Metrics in Deep Reinforcement Learning

Figure 2 for Learning Rewards to Optimize Global Performance Metrics in Deep Reinforcement Learning

Figure 3 for Learning Rewards to Optimize Global Performance Metrics in Deep Reinforcement Learning

Figure 4 for Learning Rewards to Optimize Global Performance Metrics in Deep Reinforcement Learning

Abstract:When applying reinforcement learning (RL) to a new problem, reward engineering is a necessary, but often difficult and error-prone task a system designer has to face. To avoid this step, we propose LR4GPM, a novel (deep) RL method that can optimize a global performance metric, which is supposed to be available as part of the problem description. LR4GPM alternates between two phases: (1) learning a (possibly vector) reward function used to fit the performance metric, and (2) training a policy to optimize an approximation of this performance metric based on the learned rewards. Such RL training is not straightforward since both the reward function and the policy are trained using non-stationary data. To overcome this issue, we propose several training tricks. We demonstrate the efficiency of LR4GPM on several domains. Notably, LR4GPM outperforms the winner of a recent autonomous driving competition organized at DAI'2020.

Via

Access Paper or Ask Questions

Neuro-Symbolic Hierarchical Rule Induction

Dec 26, 2021

Claire Glanois, Xuening Feng, Zhaohui Jiang, Paul Weng, Matthieu Zimmer, Dong Li, Wulong Liu

Figure 1 for Neuro-Symbolic Hierarchical Rule Induction

Figure 2 for Neuro-Symbolic Hierarchical Rule Induction

Figure 3 for Neuro-Symbolic Hierarchical Rule Induction

Figure 4 for Neuro-Symbolic Hierarchical Rule Induction

Abstract:We propose an efficient interpretable neuro-symbolic model to solve Inductive Logic Programming (ILP) problems. In this model, which is built from a set of meta-rules organised in a hierarchical structure, first-order rules are invented by learning embeddings to match facts and body predicates of a meta-rule. To instantiate it, we specifically design an expressive set of generic meta-rules, and demonstrate they generate a consequent fragment of Horn clauses. During training, we inject a controlled \pw{Gumbel} noise to avoid local optima and employ interpretability-regularization term to further guide the convergence to interpretable rules. We empirically validate our model on various tasks (ILP, visual genome, reinforcement learning) against several state-of-the-art methods.

* 10 pages, Figures et references

Via

Access Paper or Ask Questions