Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ujjwal Puri

Guaranteed Trust Region Optimization via Two-Phase KL Penalization

Dec 08, 2023

K. R. Zentner, Ujjwal Puri, Zhehui Huang, Gaurav S. Sukhatme

Abstract:On-policy reinforcement learning (RL) has become a popular framework for solving sequential decision problems due to its computational efficiency and theoretical simplicity. Some on-policy methods guarantee every policy update is constrained to a trust region relative to the prior policy to ensure training stability. These methods often require computationally intensive non-linear optimization or require a particular form of action distribution. In this work, we show that applying KL penalization alone is nearly sufficient to enforce such trust regions. Then, we show that introducing a "fixup" phase is sufficient to guarantee a trust region is enforced on every policy update while adding fewer than 5% additional gradient steps in practice. The resulting algorithm, which we call FixPO, is able to train a variety of policy architectures and action spaces, is easy to implement, and produces results competitive with other trust region methods.

Via

Access Paper or Ask Questions

A Simple Approach to Continual Learning by Transferring Skill Parameters

Oct 19, 2021

K. R. Zentner, Ryan Julian, Ujjwal Puri, Yulun Zhang, Gaurav S. Sukhatme

Figure 1 for A Simple Approach to Continual Learning by Transferring Skill Parameters

Figure 2 for A Simple Approach to Continual Learning by Transferring Skill Parameters

Figure 3 for A Simple Approach to Continual Learning by Transferring Skill Parameters

Figure 4 for A Simple Approach to Continual Learning by Transferring Skill Parameters

Abstract:In order to be effective general purpose machines in real world environments, robots not only will need to adapt their existing manipulation skills to new circumstances, they will need to acquire entirely new skills on-the-fly. A great promise of continual learning is to endow robots with this ability, by using their accumulated knowledge and experience from prior skills. We take a fresh look at this problem, by considering a setting in which the robot is limited to storing that knowledge and experience only in the form of learned skill policies. We show that storing skill policies, careful pre-training, and appropriately choosing when to transfer those skill policies is sufficient to build a continual learner in the context of robotic manipulation. We analyze which conditions are needed to transfer skills in the challenging Meta-World simulation benchmark. Using this analysis, we introduce a pair-wise metric relating skills that allows us to predict the effectiveness of skill transfer between tasks, and use it to reduce the problem of continual learning to curriculum selection. Given an appropriate curriculum, we show how to continually acquire robotic manipulation skills without forgetting, and using far fewer samples than needed to train them from scratch.

* Submitted to ICRA 2022

Via

Access Paper or Ask Questions

Towards Exploiting Geometry and Time for Fast Off-Distribution Adaptation in Multi-Task Robot Learning

Jun 29, 2021

K. R. Zentner, Ryan Julian, Ujjwal Puri, Yulun Zhang, Gaurav Sukhatme

Figure 1 for Towards Exploiting Geometry and Time for Fast Off-Distribution Adaptation in Multi-Task Robot Learning

Figure 2 for Towards Exploiting Geometry and Time for Fast Off-Distribution Adaptation in Multi-Task Robot Learning

Abstract:We explore possible methods for multi-task transfer learning which seek to exploit the shared physical structure of robotics tasks. Specifically, we train policies for a base set of pre-training tasks, then experiment with adapting to new off-distribution tasks, using simple architectural approaches for re-using these policies as black-box priors. These approaches include learning an alignment of either the observation space or action space from a base to a target task to exploit rigid body structure, and methods for learning a time-domain switching policy across base tasks which solves the target task, to exploit temporal coherence. We find that combining low-complexity target policy classes, base policies as black-box priors, and simple optimization algorithms allows us to acquire new tasks outside the base task distribution, using small amounts of offline training data.

* Accepted to Challenges of Real World Reinforcement Learning, Virtual Workshop at NeurIPS 2020

Via

Access Paper or Ask Questions