Abstract:What drives exploration? Understanding intrinsic motivation is a long-standing challenge in both cognitive science and artificial intelligence; numerous objectives have been proposed and used to train agents, yet there remains a gap between human and agent exploration. We directly compare adults, children, and AI agents in a complex open-ended environment, Crafter, and study how common intrinsic objectives: Entropy, Information Gain, and Empowerment, relate to their behavior. We find that only Entropy and Empowerment are consistently positively correlated with human exploration progress, indicating that these objectives may better inform intrinsic reward design for agents. Furthermore, across agents and humans we observe that Entropy initially increases rapidly, then plateaus, while Empowerment increases continuously, suggesting that state diversity may provide more signal in early exploration, while advanced exploration should prioritize control. Finally, we find preliminary evidence that private speech utterances, and particularly goal verbalizations, may aid exploration in children.
Abstract:Intrinsic motivation (IM) and reward shaping are common methods for guiding the exploration of reinforcement learning (RL) agents by adding pseudo-rewards. Designing these rewards is challenging, however, and they can counter-intuitively harm performance. To address this, we characterize them as reward shaping in Bayes-Adaptive Markov Decision Processes (BAMDPs), which formalizes the value of exploration by formulating the RL process as updating a prior over possible MDPs through experience. RL algorithms can be viewed as BAMDP policies; instead of attempting to find optimal algorithms by solving BAMDPs directly, we use it at a theoretical framework for understanding how pseudo-rewards guide suboptimal algorithms. By decomposing BAMDP state value into the value of the information collected plus the prior value of the physical state, we show how psuedo-rewards can help by compensating for RL algorithms' misestimation of these two terms, yielding a new typology of IM and reward shaping approaches. We carefully extend the potential-based shaping theorem to BAMDPs to prove that when pseudo-rewards are BAMDP Potential-based shaping Functions (BAMPFs), they preserve optimal, or approximately optimal, behavior of RL algorithms; otherwise, they can corrupt even optimal learners. We finally give guidance on how to design or convert existing pseudo-rewards to BAMPFs by expressing assumptions about the environment as potential functions on BAMDP states.