Abstract:In reinforcement learning (RL), temporal difference (TD) error is known to be related to the firing rate of dopamine neurons. It has been observed that each dopamine neuron does not behave uniformly, but each responds to the TD error in an optimistic or pessimistic manner, interpreted as a kind of distributional RL. To explain such a biological data, a heuristic model has also been designed with learning rates asymmetric for the positive and negative TD errors. However, this heuristic model is not theoretically-grounded and unknown whether it can work as a RL algorithm. This paper therefore introduces a novel theoretically-grounded model with optimism and pessimism, which is derived from control as inference. In combination with ensemble learning, a distributional value function as a critic is estimated from regularly introduced optimism and pessimism. Based on its central value, a policy in an actor is improved. This proposed algorithm, so-called DROP (distributional and regular optimism and pessimism), is compared on dynamic tasks. Although the heuristic model showed poor learning performance, DROP showed excellent one in all tasks with high generality. In other words, it was suggested that DROP is a new model that can elicit the potential contributions of optimism and pessimism.
Abstract:The problem of uncertainty is a feature of real world robotics problems and any control framework must contend with it in order to succeed in real applications tasks. Reinforcement Learning is no different, and epistemic uncertainty arising from model uncertainty or misspecification is a challenge well captured by the sim-to-real gap. A simple solution to this issue is domain randomization (DR), which unfortunately can result in conservative agents. As a remedy to this conservativeness, the use of universal policies that take additional information about the randomized domain has risen as an alternative solution, along with recurrent neural network-based controllers. Uncertainty-aware universal policies present a particularly compelling solution able to account for system identification uncertainties during deployment. In this paper, we reveal that the challenge of efficiently optimizing uncertainty-aware policies can be fundamentally reframed as solving the convex coverage set (CCS) problem within a multi-objective reinforcement learning (MORL) context. By introducing a novel Markov decision process (MDP) framework where each domain's performance is treated as an independent objective, we unify the training of uncertainty-aware policies with MORL approaches. This connection enables the application of MORL algorithms for domain randomization (DR), allowing for more efficient policy optimization. To illustrate this, we focus on the linear utility function, which aligns with the expectation in DR formulations, and propose a series of algorithms adapted from the MORL literature to solve the CCS, demonstrating their ability to enhance the performance of uncertainty-aware policies.
Abstract:Model-based reinforcement learning has attracted much attention due to its high sample efficiency and is expected to be applied to real-world robotic applications. In the real world, as unobservable disturbances can lead to unexpected situations, robot policies should be taken to improve not only control performance but also robustness. Adversarial learning is an effective way to improve robustness, but excessive adversary would increase the risk of malfunction, and make the control performance too conservative. Therefore, this study addresses a new adversarial learning framework to make reinforcement learning robust moderately and not conservative too much. To this end, the adversarial learning is first rederived with variational inference. In addition, light robustness, which allows for maximizing robustness within an acceptable performance degradation, is utilized as a constraint. As a result, the proposed framework, so-called LiRA, can automatically adjust adversary level, balancing robustness and conservativeness. The expected behaviors of LiRA are confirmed in numerical simulations. In addition, LiRA succeeds in learning a force-reactive gait control of a quadrupedal robot only with real-world data collected less than two hours.
Abstract:Experience replay (ER) used in (deep) reinforcement learning is considered to be applicable only to off-policy algorithms. However, there have been some cases in which ER has been applied for on-policy algorithms, suggesting that off-policyness might be a sufficient condition for applying ER. This paper reconsiders more strict "experience replayable conditions" (ERC) and proposes the way of modifying the existing algorithms to satisfy ERC. To this end, instability of policy improvements is assumed to be a key in ERC. The instability factors are revealed from the viewpoint of metric learning as i) repulsive forces from negative samples and ii) replays of inappropriate experiences. Accordingly, the corresponding stabilization tricks are derived. As a result, it is confirmed through numerical simulations that the proposed stabilization tricks make ER applicable to an advantage actor-critic, an on-policy algorithm. In addition, its learning performance is comparable to that of a soft actor-critic, a state-of-the-art off-policy algorithm.
Abstract:Robot control using reinforcement learning has become popular, but its learning process generally terminates halfway through an episode for safety and time-saving reasons. This study addresses the problem of the most popular exception handling that temporal-difference (TD) learning performs at such termination. That is, by forcibly assuming zero value after termination, unintentionally implicit underestimation or overestimation occurs, depending on the reward design in the normal states. When the episode is terminated due to task failure, the failure may be highly valued with the unintentional overestimation, and the wrong policy may be acquired. Although this problem can be avoided by paying attention to the reward design, it is essential in practical use of TD learning to review the exception handling at termination. This paper therefore proposes a method to intentionally underestimate the value after termination to avoid learning failures due to the unintentional overestimation. In addition, the degree of underestimation is adjusted according to the degree of stationarity at termination, thereby preventing excessive exploration due to the intentional underestimation. Simulations and real robot experiments showed that the proposed method can stably obtain the optimal policies for various tasks and reward designs. https://youtu.be/AxXr8uFOe7M
Abstract:Soft actor-critic (SAC) in reinforcement learning is expected to be one of the next-generation robot control schemes. Its ability to maximize policy entropy would make a robotic controller robust to noise and perturbation, which is useful for real-world robot applications. However, the priority of maximizing the policy entropy is automatically tuned in the current implementation, the rule of which can be interpreted as one for equality constraint, binding the policy entropy into its specified target value. The current SAC is therefore no longer maximize the policy entropy, contrary to our expectation. To resolve this issue in SAC, this paper improves its implementation with a slack variable for appropriately handling the inequality constraint to maximize the policy entropy. In Mujoco and Pybullet simulators, the modified SAC achieved the higher robustness and the more stable learning than before while regularizing the norm of action. In addition, a real-robot variable impedance task was demonstrated for showing the applicability of the modified SAC to real-world robot control.
Abstract:This paper introduces a novel method of adding intrinsic bonuses to task-oriented reward function in order to efficiently facilitate reinforcement learning search. While various bonuses have been designed to date, they are analogous to the depth-first and breadth-first search algorithms in graph theory. This paper, therefore, first designs two bonuses for each of them. Then, a heuristic gain scheduling is applied to the designed bonuses, inspired by the iterative deepening search, which is known to inherit the advantages of the two search algorithms. The proposed method is expected to allow agent to efficiently reach the best solution in deeper states by gradually exploring unknown states. In three locomotion tasks with dense rewards and three simple tasks with sparse rewards, it is shown that the two types of bonuses contribute to the performance improvement of the different tasks complementarily. In addition, by combining them with the proposed gain scheduling, all tasks can be accomplished with high performance.
Abstract:Sampling-based model predictive control (MPC) can be applied to versatile robotic systems. However, the real-time control with it is a big challenge due to its unstable updates and poor convergence. This paper tackles this challenge with a novel derivation from reverse Kullback-Leibler divergence, which has a mode-seeking behavior and is likely to find one of the sub-optimal solutions early. With this derivation, a weighted maximum likelihood estimation with positive/negative weights is obtained, solving by mirror descent (MD) algorithm. While the negative weights eliminate unnecessary actions, that requires to develop a practical implementation that avoids the interference with positive/negative updates based on rejection sampling. In addition, although the convergence of MD can be accelerated with Nesterov's acceleration method, it is modified for the proposed MPC with a heuristic of a step size adaptive to the noise estimated in update amounts. In the real-time simulations, the proposed method can solve more tasks statistically than the conventional method and accomplish more complex tasks only with a CPU due to the improved acceleration. In addition, its applicability is also demonstrated in a variable impedance control of a force-driven mobile robot. https://youtu.be/D8bFMzct1XM
Abstract:Extraction of low-dimensional latent space from high-dimensional observation data is essential to construct a real-time robot controller with a world model on the extracted latent space. However, there is no established method for tuning the dimension size of the latent space automatically, suffering from finding the necessary and sufficient dimension size, i.e. the minimal realization of the world model. In this study, we analyze and improve Tsallis-based variational autoencoder (q-VAE), and reveal that, under an appropriate configuration, it always facilitates making the latent space sparse. Even if the dimension size of the pre-specified latent space is redundant compared to the minimal realization, this sparsification collapses unnecessary dimensions, allowing for easy removal of them. We experimentally verified the benefits of the sparsification by the proposed method that it can easily find the necessary and sufficient six dimensions for a reaching task with a mobile manipulator that requires a six-dimensional state space. Moreover, by planning with such a minimal-realization world model learned in the extracted dimensions, the proposed method was able to exert a more optimal action sequence in real-time, reducing the reaching accomplishment time by around 20 %. The attached video is uploaded on youtube: https://youtu.be/-QjITrnxaRs
Abstract:Deep reinforcement learning (DRL) is one of the promising approaches for introducing robots into complicated environments. The recent remarkable progress of DRL stands on regularization of policy, which allows the policy to improve stably and efficiently. A popular method, so-called proximal policy optimization (PPO), and its variants constrain density ratio of the latest and baseline policies when the density ratio exceeds a given threshold. This threshold can be designed relatively intuitively, and in fact its recommended value range has been suggested. However, the density ratio is asymmetric for its center, and the possible error scale from its center, which should be close to the threshold, would depend on how the baseline policy is given. In order to maximize the values of regularization of policy, this paper proposes a new PPO derived using relative Pearson (RPE) divergence, therefore so-called PPO-RPE, to design the threshold adaptively. In PPO-RPE, the relative density ratio, which can be formed with symmetry, replaces the raw density ratio. Thanks to this symmetry, its error scale from center can easily be estimated, hence, the threshold can be adapted for the estimated error scale. From three simple benchmark simulations, the importance of algorithm-dependent threshold design is revealed. By simulating additional four locomotion tasks, it is verified that the proposed method statistically contributes to task accomplishment by appropriately restricting the policy updates.