Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jesus Bujalance Martin

Reward Relabelling for combined Reinforcement and Imitation Learning on sparse-reward tasks

Jan 11, 2022

Jesus Bujalance Martin, Fabien Moutarde

Figure 1 for Reward Relabelling for combined Reinforcement and Imitation Learning on sparse-reward tasks

Figure 2 for Reward Relabelling for combined Reinforcement and Imitation Learning on sparse-reward tasks

Figure 3 for Reward Relabelling for combined Reinforcement and Imitation Learning on sparse-reward tasks

Figure 4 for Reward Relabelling for combined Reinforcement and Imitation Learning on sparse-reward tasks

Abstract:During recent years, deep reinforcement learning (DRL) has made successful incursions into complex decision-making applications such as robotics, autonomous driving or video games. In the search for more sample-efficient algorithms, a promising direction is to leverage as much external off-policy data as possible. One staple of this data-driven approach is to learn from expert demonstrations. In the past, multiple ideas have been proposed to make good use of the demonstrations added to the replay buffer, such as pretraining on demonstrations only or minimizing additional cost functions. We present a new method, able to leverage demonstrations and episodes collected online in any sparse-reward environment with any off-policy algorithm. Our method is based on a reward bonus given to demonstrations and successful episodes, encouraging expert imitation and self-imitation. First, we give a reward bonus to the transitions coming from demonstrations to encourage the agent to match the demonstrated behaviour. Then, upon collecting a successful episode, we relabel its transitions with the same bonus before adding them to the replay buffer, encouraging the agent to also match its previous successes. Our experiments focus on manipulation robotics, specifically on three tasks for a 6 degrees-of-freedom robotic arm in simulation. We show that our method based on reward relabeling improves the performance of the base algorithm (SAC and DDPG) on these tasks, even in the absence of demonstrations. Furthermore, integrating into our method two improvements from previous works allows our approach to outperform all baselines.

* arXiv admin note: substantial text overlap with arXiv:2110.14464

Via

Access Paper or Ask Questions

Learning from demonstrations with SACR2: Soft Actor-Critic with Reward Relabeling

Oct 27, 2021

Jesus Bujalance Martin, Raphaël Chekroun, Fabien Moutarde

Figure 1 for Learning from demonstrations with SACR2: Soft Actor-Critic with Reward Relabeling

Figure 2 for Learning from demonstrations with SACR2: Soft Actor-Critic with Reward Relabeling

Figure 3 for Learning from demonstrations with SACR2: Soft Actor-Critic with Reward Relabeling

Figure 4 for Learning from demonstrations with SACR2: Soft Actor-Critic with Reward Relabeling

Abstract:During recent years, deep reinforcement learning (DRL) has made successful incursions into complex decision-making applications such as robotics, autonomous driving or video games. However, a well-known caveat of DRL algorithms is their inefficiency, requiring huge amounts of data to converge. Off-policy algorithms tend to be more sample-efficient, and can additionally benefit from any off-policy data stored in the replay buffer. Expert demonstrations are a popular source for such data: the agent is exposed to successful states and actions early on, which can accelerate the learning process and improve performance. In the past, multiple ideas have been proposed to make good use of the demonstrations in the buffer, such as pretraining on demonstrations only or minimizing additional cost functions. We carry on a study to evaluate several of these ideas in isolation, to see which of them have the most significant impact. We also present a new method, based on a reward bonus given to demonstrations and successful episodes. First, we give a reward bonus to the transitions coming from demonstrations to encourage the agent to match the demonstrated behaviour. Then, upon collecting a successful episode, we relabel its transitions with the same bonus before adding them to the replay buffer, encouraging the agent to also match its previous successes. The base algorithm for our experiments is the popular Soft Actor-Critic (SAC), a state-of-the-art off-policy algorithm for continuous action spaces. Our experiments focus on robotics, specifically on a reaching task for a robotic arm in simulation. We show that our method SACR2 based on reward relabeling improves the performance on this task, even in the absence of demonstrations.

* Presented at Deep RL Workshop, NeurIPS 2021

Via

Access Paper or Ask Questions

Towards a Sample Efficient Reinforcement Learning Pipeline for Vision Based Robotics

May 20, 2021

Maxence Mahe, Pierre Belamri, Jesus Bujalance Martin

Figure 1 for Towards a Sample Efficient Reinforcement Learning Pipeline for Vision Based Robotics

Figure 2 for Towards a Sample Efficient Reinforcement Learning Pipeline for Vision Based Robotics

Figure 3 for Towards a Sample Efficient Reinforcement Learning Pipeline for Vision Based Robotics

Figure 4 for Towards a Sample Efficient Reinforcement Learning Pipeline for Vision Based Robotics

Abstract:Deep Reinforcement learning holds the guarantee of empowering self-ruling robots to master enormous collections of conduct abilities with negligible human mediation. The improvements brought by this technique enables robots to perform difficult tasks such as grabbing or reaching targets. Nevertheless, the training process is still time consuming and tedious especially when learning policies only with RGB camera information. This way of learning is capital to transfer the task from simulation to the real world since the only external source of information for the robot in real life is video. In this paper, we study how to limit the time taken for training a robotic arm with 6 Degrees Of Freedom (DOF) to reach a ball from scratch by assembling a pipeline as efficient as possible. The pipeline is divided into two parts: the first one is to capture the relevant information from the RGB video with a Computer Vision algorithm. The second one studies how to train faster a Deep Reinforcement Learning algorithm in order to make the robotic arm reach the target in front of him. Follow this link to find videos and plots in higher resolution: \url{https://drive.google.com/drive/folders/1_lRlDSoPzd_GTcVrxNip10o_lm-_DPdn?usp=sharing}

* 10 Pages, 15 Figures, 1 Table

Via

Access Paper or Ask Questions