Abstract:Multi-agent reinforcement learning is a challenging and active field of research due to the inherent nonstationary property and coupling between agents. A popular approach to modeling the multi-agent interactions underlying the multi-agent RL problem is the Markov Game. There is a special type of Markov Game, termed Markov Potential Game, which allows us to reduce the Markov Game to a single-objective optimal control problem where the objective function is a potential function. In this work, we prove that a multi-agent collaborative field coverage problem, which is found in many engineering applications, can be formulated as a Markov Potential Game, and we can learn a parameterized closed-loop Nash Equilibrium by solving an equivalent single-objective optimal control problem. As a result, our algorithm is 10x faster during training compared to a game-theoretic baseline and converges faster during policy execution.
Abstract:Many inference scaling methods work by breaking a problem into smaller steps (or groups of tokens), then sampling and choosing the best next step. However, these steps and their sizes are usually predetermined based on human intuition or domain knowledge. This paper introduces dynamic decomposition, a method that automatically and adaptively splits solution and reasoning traces into steps during inference. This approach improves computational efficiency by focusing more resources on difficult steps, breaking them down further and prioritizing their sampling. Experiments on coding and math benchmarks (APPS, MATH, and LiveCodeBench) show that dynamic decomposition performs better than static methods, which rely on fixed steps like token-level, sentence-level, or single-step decompositions. These results suggest that dynamic decomposition can enhance many inference scaling techniques.
Abstract:In this paper, we present a control design methodology, stability criteria, and performance bounds for autonomous helicopter aerial refueling. Autonomous aerial refueling is particularly difficult due to the aerodynamic interaction between the wake of the tanker, the contact-sensitive nature of the maneuver, and the uncertainty in drogue motion. Since the probe tip is located significantly away from the helicopter's center-of-gravity, its position (and velocity) is strongly sensitive to the helicopter's attitude (and angular rates). In addition, the fact that the helicopter is operating at high speeds to match the velocity of the tanker forces it to maintain a particular orientation, making the docking maneuver especially challenging. In this paper, we propose a novel outer-loop position controller that incorporates the probe position and velocity into the feedback loop. The position and velocity of the probe tip depend both on the position (velocity) and on the attitude (angular rates) of the aircraft. We derive analytical guarantees for docking performance in terms of the uncertainty of the drogue motion and the angular acceleration of the helicopter, using the ultimate boundedness property of the closed-loop error dynamics. Simulations are performed on a high-fidelity UH60 helicopter model with a high-fidelity drogue motion under wind effects to validate the proposed approach for realistic refueling scenarios. These high-fidelity simulations reveal that the proposed control methodology yields an improvement of 36% in the 2-norm docking error compared to the existing standard controller.
Abstract:In pursuit of reinforcement learning systems that could train in physical environments, we investigate multi-task approaches as a means to alleviate the need for massive data acquisition. In a tabular scenario where the Q-functions are collected across tasks, we model our learning problem as optimizing a higher order tensor structure. Recognizing that close-related tasks may require similar actions, our proposed method imposes a low-rank condition on this aggregated Q-tensor. The rationale behind this approach to multi-task learning is that the low-rank structure enforces the notion of similarity, without the need to explicitly prescribe which tasks are similar, but inferring this information from a reduced amount of data simultaneously with the stochastic optimization of the Q-tensor. The efficiency of our low-rank tensor approach to multi-task learning is demonstrated in two numerical experiments, first in a benchmark environment formed by a collection of inverted pendulums, and then into a practical scenario involving multiple wireless communication devices.
Abstract:In this work, we present an approach to minimizing the time necessary for the end-effector of a redundant robot manipulator to traverse a Cartesian path by optimizing the trajectory of its joints. Each joint has limits in the ranges of position, velocity and acceleration, the latter making jerks in joint space undesirable. The proposed approach takes this nonlinear optimization problem whose variables are path speed and joint trajectory and reformulates it into a bi-level problem. The lower-level formulation is a convex subproblem that considers a fixed joint trajectory and maximizes path speed while considering all joint velocity and acceleration constraints. Under particular conditions, this subproblem has a closed-form solution. Then, we solve a higher-level subproblem by leveraging the directional derivative of the lower-level value with respect to the joint trajectory parameters. In particular, we use this direction to implement a Primal-Dual method that considers the path accuracy and joint position constraints. We show the efficacy of our proposed approach with simulations and experimental results.
Abstract:Electricity grid's resiliency and climate change strongly impact one another due to an array of technical and policy-related decisions that impact both. This paper introduces a physics-informed machine learning-based framework to enhance grid's resiliency. Specifically, when encountering disruptive events, this paper designs remedial control actions to prevent blackouts. The proposed Physics-Guided Reinforcement Learning (PG-RL) framework determines effective real-time remedial line-switching actions, considering their impact on power balance, system security, and grid reliability. To identify an effective blackout mitigation policy, PG-RL leverages power-flow sensitivity factors to guide the RL exploration during agent training. Comprehensive evaluations using the Grid2Op platform demonstrate that incorporating physical signals into RL significantly improves resource utilization within electric grids and achieves better blackout mitigation policies - both of which are critical in addressing climate change.
Abstract:Pretrained foundation models have exhibited extraordinary in-context learning performance, allowing zero-shot generalization to new tasks not encountered during the pretraining. In the case of RL, in-context RL (ICRL) emerges when pretraining FMs on decision-making problems in an autoregressive-supervised manner. Nevertheless, current state-of-the-art ICRL algorithms, such as AD, DPT and DIT, impose stringent requirements on generating the pretraining dataset concerning the behavior (source) policies, context information, and action labels, etc. Notably, these algorithms either demand optimal policies or require varying degrees of well-trained behavior policies for all environments during the generation of the pretraining dataset. This significantly hinders the application of ICRL to real-world scenarios, where acquiring optimal or well-trained policies for a substantial volume of real-world training environments can be both prohibitively intractable and expensive. To overcome this challenge, we introduce a novel approach, termed State-Action Distillation (SAD), that allows to generate a remarkable pretraining dataset guided solely by random policies. In particular, SAD selects query states and corresponding action labels by distilling the outstanding state-action pairs from the entire state and action spaces by using random policies within a trust horizon, and then inherits the classical autoregressive-supervised mechanism during the pretraining. To the best of our knowledge, this is the first work that enables promising ICRL under (e.g., uniform) random policies and random contexts. We establish theoretical analyses regarding the performance guarantees of SAD. Moreover, our empirical results across multiple ICRL benchmark environments demonstrate that, on average, SAD outperforms the best baseline by 180.86% in the offline evaluation and by 172.8% in the online evaluation.
Abstract:This work considers the problem of transfer learning in the context of reinforcement learning. Specifically, we consider training a policy in a reduced order system and deploying it in the full state system. The motivation for this training strategy is that running simulations in the full-state system may take excessive time if the dynamics are complex. While transfer learning alleviates the computational issue, the transfer guarantees depend on the discrepancy between the two systems. In this work, we consider a class of cascade dynamical systems, where the dynamics of a subset of the state-space influence the rest of the states but not vice-versa. The reinforcement learning policy learns in a model that ignores the dynamics of these states and treats them as commanded inputs. In the full-state system, these dynamics are handled using a classic controller (e.g., a PID). These systems have vast applications in the control literature and their structure allows us to provide transfer guarantees that depend on the stability of the inner loop controller. Numerical experiments on a quadrotor support the theoretical findings.
Abstract:Offline reinforcement learning (RL) learns effective policies from a static target dataset. Despite state-of-the-art (SOTA) offline RL algorithms being promising, they highly rely on the quality of the target dataset. The performance of SOTA algorithms can degrade in scenarios with limited samples in the target dataset, which is often the case in real-world applications. To address this issue, domain adaptation that leverages auxiliary samples from related source datasets (such as simulators) can be beneficial. In this context, determining the optimal way to trade off the source and target datasets remains a critical challenge in offline RL. To the best of our knowledge, this paper proposes the first framework that theoretically and experimentally explores how the weight assigned to each dataset affects the performance of offline RL. We establish the performance bounds and convergence neighborhood of our framework, both of which depend on the selection of the weight. Furthermore, we identify the existence of an optimal weight for balancing the two datasets. All theoretical guarantees and optimal weight depend on the quality of the source dataset and the size of the target dataset. Our empirical results on the well-known Procgen Benchmark substantiate our theoretical contributions.
Abstract:We devise a control-theoretic reinforcement learning approach to support direct learning of the optimal policy. We establish theoretical properties of our approach and derive an algorithm based on a specific instance of this approach. Our empirical results demonstrate the significant benefits of our approach.