Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:BAMDP Shaping: a Unified Theoretical Framework for Intrinsic Motivation and Reward Shaping

Sep 09, 2024

Aly Lidayan, Michael Dennis, Stuart Russell

Figure 1 for BAMDP Shaping: a Unified Theoretical Framework for Intrinsic Motivation and Reward Shaping

Figure 2 for BAMDP Shaping: a Unified Theoretical Framework for Intrinsic Motivation and Reward Shaping

Figure 3 for BAMDP Shaping: a Unified Theoretical Framework for Intrinsic Motivation and Reward Shaping

Figure 4 for BAMDP Shaping: a Unified Theoretical Framework for Intrinsic Motivation and Reward Shaping

Share this with someone who'll enjoy it:

Abstract:Intrinsic motivation (IM) and reward shaping are common methods for guiding the exploration of reinforcement learning (RL) agents by adding pseudo-rewards. Designing these rewards is challenging, however, and they can counter-intuitively harm performance. To address this, we characterize them as reward shaping in Bayes-Adaptive Markov Decision Processes (BAMDPs), which formalizes the value of exploration by formulating the RL process as updating a prior over possible MDPs through experience. RL algorithms can be viewed as BAMDP policies; instead of attempting to find optimal algorithms by solving BAMDPs directly, we use it at a theoretical framework for understanding how pseudo-rewards guide suboptimal algorithms. By decomposing BAMDP state value into the value of the information collected plus the prior value of the physical state, we show how psuedo-rewards can help by compensating for RL algorithms' misestimation of these two terms, yielding a new typology of IM and reward shaping approaches. We carefully extend the potential-based shaping theorem to BAMDPs to prove that when pseudo-rewards are BAMDP Potential-based shaping Functions (BAMPFs), they preserve optimal, or approximately optimal, behavior of RL algorithms; otherwise, they can corrupt even optimal learners. We finally give guidance on how to design or convert existing pseudo-rewards to BAMPFs by expressing assumptions about the environment as potential functions on BAMDP states.

View paper on

Share this with someone who'll enjoy it:

Title:BAMDP Shaping: a Unified Theoretical Framework for Intrinsic Motivation and Reward Shaping

Paper and Code