Abstract:This paper proposes an Online Control-Informed Learning (OCIL) framework, which synthesizes the well-established control theories to solve a broad class of learning and control tasks in real time. This novel integration effectively handles practical issues in machine learning such as noisy measurement data, online learning, and data efficiency. By considering any robot as a tunable optimal control system, we propose an online parameter estimator based on extended Kalman filter (EKF) to incrementally tune the system in real time, enabling it to complete designated learning or control tasks. The proposed method also improves robustness in learning by effectively managing noise in the data. Theoretical analysis is provided to demonstrate the convergence and regret of OCIL. Three learning modes of OCIL, i.e. Online Imitation Learning, Online System Identification, and Policy Tuning On-the-fly, are investigated via experiments, which validate their effectiveness.
Abstract:Zero-shot coordination (ZSC) is a significant challenge in multi-agent collaboration, aiming to develop agents that can coordinate with unseen partners they have not encountered before. Recent cutting-edge ZSC methods have primarily focused on two-player video games such as OverCooked!2 and Hanabi. In this paper, we extend the scope of ZSC research to the multi-drone cooperative pursuit scenario, exploring how to construct a drone agent capable of coordinating with multiple unseen partners to capture multiple evaders. We propose a novel Hypergraphic Open-ended Learning Algorithm (HOLA-Drone) that continuously adapts the learning objective based on our hypergraphic-form game modeling, aiming to improve cooperative abilities with multiple unknown drone teammates. To empirically verify the effectiveness of HOLA-Drone, we build two different unseen drone teammate pools to evaluate their performance in coordination with various unseen partners. The experimental results demonstrate that HOLA-Drone outperforms the baseline methods in coordination with unseen drone teammates. Furthermore, real-world experiments validate the feasibility of HOLA-Drone in physical systems. Videos can be found on the project homepage~\url{https://sites.google.com/view/hola-drone}.
Abstract:In this paper, we discuss the development and deployment of a robust autonomous system capable of performing various tasks in the maritime domain under unknown dynamic conditions. We investigate a data-driven approach based on modular design for ease of transfer of autonomy across different maritime surface vessel platforms. The data-driven approach alleviates issues related to a priori identification of system models that may become deficient under evolving system behaviors or shifting, unanticipated, environmental influences. Our proposed learning-based platform comprises a deep Koopman system model and a change point detector that provides guidance on domain shifts prompting relearning under severe exogenous and endogenous perturbations. Motion control of the autonomous system is achieved via an optimal controller design. The Koopman linearized model naturally lends itself to a linear-quadratic regulator (LQR) control design. We propose the C3D control architecture Cascade Control with Change Point Detection and Deep Koopman Learning. The framework is verified in station keeping task on an ASV in both simulation and real experiments. The approach achieved at least 13.9 percent improvement in mean distance error in all test cases compared to the methods that do not consider system changes.
Abstract:Multi-armed bandit algorithms provide solutions for sequential decision-making where learning takes place by interacting with the environment. In this work, we model a distributed optimization problem as a multi-agent kernelized multi-armed bandit problem with a heterogeneous reward setting. In this setup, the agents collaboratively aim to maximize a global objective function which is an average of local objective functions. The agents can access only bandit feedback (noisy reward) obtained from the associated unknown local function with a small norm in reproducing kernel Hilbert space (RKHS). We present a fully decentralized algorithm, Multi-agent IGP-UCB (MA-IGP-UCB), which achieves a sub-linear regret bound for popular classes for kernels while preserving privacy. It does not necessitate the agents to share their actions, rewards, or estimates of their local function. In the proposed approach, the agents sample their individual local functions in a way that benefits the whole network by utilizing a running consensus to estimate the upper confidence bound on the global function. Furthermore, we propose an extension, Multi-agent Delayed IGP-UCB (MAD-IGP-UCB) algorithm, which reduces the dependence of the regret bound on the number of agents in the network. It provides improved performance by utilizing a delay in the estimation update step at the cost of more communication.
Abstract:This study introduces an approach to obtain a neighboring extremal optimal control (NEOC) solution for a closed-loop optimal control problem, applicable to a wide array of nonlinear systems and not necessarily quadratic performance indices. The approach involves investigating the variation incurred in the functional form of a known closed-loop optimal control law due to small, known parameter variations in the system equations or the performance index. The NEOC solution can formally be obtained by solving a linear partial differential equation, akin to those encountered in the iterative solution of a nonlinear Hamilton-Jacobi equation. Motivated by numerical procedures for solving these latter equations, we also propose a numerical algorithm based on the Galerkin algorithm, leveraging the use of basis functions to solve the underlying Hamilton-Jacobi equation of the original optimal control problem. The proposed approach simplifies the NEOC problem by reducing it to the solution of a simple set of linear equations, thereby eliminating the need for a full re-solution of the adjusted optimal control problem. Furthermore, the variation to the optimal performance index can be obtained as a function of both the system state and small changes in parameters, allowing the determination of the adjustment to an optimal control law given a small adjustment of parameters in the system or the performance index. Moreover, in order to handle large known parameter perturbations, we propose a homotopic approach that breaks down the single calculation of NEOC into a finite set of multiple steps. Finally, the validity of the claims and theory is supported by theoretical analysis and numerical simulations.
Abstract:This paper proposes a policy learning algorithm based on the Koopman operator theory and policy gradient approach, which seeks to approximate an unknown dynamical system and search for optimal policy simultaneously, using the observations gathered through interaction with the environment. The proposed algorithm has two innovations: first, it introduces the so-called deep Koopman representation into the policy gradient to achieve a linear approximation of the unknown dynamical system, all with the purpose of improving data efficiency; second, the accumulated errors for long-term tasks induced by approximating system dynamics are avoided by applying Bellman's principle of optimality. Furthermore, a theoretical analysis is provided to prove the asymptotic convergence of the proposed algorithm and characterize the corresponding sampling complexity. These conclusions are also supported by simulations on several challenging benchmark environments.
Abstract:This paper develops a policy learning method for tuning a pre-trained policy to adapt to additional tasks without altering the original task. A method named Adaptive Policy Gradient (APG) is proposed in this paper, which combines Bellman's principle of optimality with the policy gradient approach to improve the convergence rate. This paper provides theoretical analysis which guarantees the convergence rate and sample complexity of $\mathcal{O}(1/T)$ and $\mathcal{O}(1/\epsilon)$, respectively, where $T$ denotes the number of iterations and $\epsilon$ denotes the accuracy of the resulting stationary policy. Furthermore, several challenging numerical simulations, including cartpole, lunar lander, and robot arm, are provided to show that APG obtains similar performance compared to existing deterministic policy gradient methods while utilizing much less data and converging at a faster rate.
Abstract:Solving a collision-aware multi-agent mission planning (task allocation and path finding) problem is challenging due to the requirement of real-time computational performance, scalability, and capability of handling static/dynamic obstacles and tasks in a cluttered environment. This paper proposes a distributed real-time (on the order of millisecond) algorithm DrMaMP, which partitions the entire unassigned task set into subsets via approximation and decomposes the original problem into several single-agent mission planning problems. This paper presents experiments with dynamic obstacles and tasks and conducts optimality and scalability comparisons with an existing method, where DrMaMP outperforms the existing method in both indices. Finally, this paper analyzes the computational burden of DrMaMP which is consistent with the observations from comparisons, and presents the optimality gap in small-size problems.
Abstract:This paper investigates the consistency problem of EKF-based cooperative localization (CL) from the perspective of Kalman decomposition, which decomposes the observable and unobservable states and allows treating them individually. The factors causing the dimension reduction of the unobservable subspace, termed error discrepancy items, are explicitly isolated and identified in the state propagation and measurement Jacobians for the first time. We prove that the error discrepancy items lead to the global orientation being erroneously observable, which in turn causes the state estimation to be inconsistent. A CL algorithm, called Kalman decomposition-based EKF (KD-EKF), is proposed to improve consistency. The key idea is to perform state estimation using the Kalman observable canonical form in the transformed coordinates. By annihilating the error discrepancy items, proper observability properties are guaranteed. More importantly, the modified state propagation and measurement Jacobians are exactly equivalent to linearizing the nonlinear CL system at current best state estimates. Consequently, the inconsistency caused by the erroneous dimension reduction of the unobservable subspace is completely eliminated. The KD-EKF CL algorithm has been extensively verified in both Monte Carlo simulations and real-world experiments and shown to achieve better performance than state-of-the-art algorithms in terms of accuracy and consistency.
Abstract:We propose a Safe Pontryagin Differentiable Programming (Safe PDP) methodology, which establishes a theoretical and algorithmic safe differentiable framework to solve a broad class of safety-critical learning and control tasks -- problems that require the guarantee of both immediate and long-term constraint satisfaction at any stage of the learning and control progress. In the spirit of interior-point methods, Safe PDP handles different types of state and input constraints by incorporating them into the cost and loss through barrier functions. We prove the following fundamental features of Safe PDP: first, both the constrained solution and its gradient in backward pass can be approximated by solving a more efficient unconstrained counterpart; second, the approximation for both the solution and its gradient can be controlled for arbitrary accuracy using a barrier parameter; and third, importantly, any intermediate results throughout the approximation and optimization are strictly respecting all constraints, thus guaranteeing safety throughout the entire learning and control process. We demonstrate the capabilities of Safe PDP in solving various safe learning and control tasks, including safe policy optimization, safe motion planning, and learning MPCs from demonstrations, on different challenging control systems such as 6-DoF maneuvering quadrotor and 6-DoF rocket powered landing.