Abstract:Adversarial Imitation Learning is traditionally framed as a two-player zero-sum game between a learner and an adversarially chosen cost function, and can therefore be thought of as the sequential generalization of a Generative Adversarial Network (GAN). A prominent example of this framework is Generative Adversarial Imitation Learning (GAIL). However, in recent years, diffusion models have emerged as a non-adversarial alternative to GANs that merely require training a score function via regression, yet produce generations of a higher quality. In response, we investigate how to lift insights from diffusion modeling to the sequential setting. We propose diffusing states and performing score-matching along diffused states to measure the discrepancy between the expert's and learner's states. Thus, our approach only requires training score functions to predict noises via standard regression, making it significantly easier and more stable to train than adversarial methods. Theoretically, we prove first- and second-order instance-dependent bounds with linear scaling in the horizon, proving that our approach avoids the compounding errors that stymie offline approaches to imitation learning. Empirically, we show our approach outperforms GAN-style imitation learning baselines across various continuous control problems, including complex tasks like controlling humanoids to walk, sit, and crawl.
Abstract:We study a collaborative learning problem where $m$ agents estimate a vector $\mu\in\mathbb{R}^d$ by collecting samples from normal distributions, with each agent $i$ incurring a cost $c_{i,k} \in (0, \infty]$ to sample from the $k^{\text{th}}$ distribution $\mathcal{N}(\mu_k, \sigma^2)$. Instead of working on their own, agents can collect data that is cheap to them, and share it with others in exchange for data that is expensive or even inaccessible to them, thereby simultaneously reducing data collection costs and estimation error. However, when agents have different collection costs, we need to first decide how to fairly divide the work of data collection so as to benefit all agents. Moreover, in naive sharing protocols, strategic agents may under-collect and/or fabricate data, leading to socially undesirable outcomes. Our mechanism addresses these challenges by combining ideas from cooperative and non-cooperative game theory. We use ideas from axiomatic bargaining to divide the cost of data collection. Given such a solution, we develop a Nash incentive-compatible (NIC) mechanism to enforce truthful reporting. We achieve a $\mathcal{O}(\sqrt{m})$ approximation to the minimum social penalty (sum of agent estimation errors and data collection costs) in the worst case, and a $\mathcal{O}(1)$ approximation under favorable conditions. We complement this with a hardness result, showing that $\Omega(\sqrt{m})$ is unavoidable in any NIC mechanism.
Abstract:We study the game modification problem, where a benevolent game designer or a malevolent adversary modifies the reward function of a zero-sum Markov game so that a target deterministic or stochastic policy profile becomes the unique Markov perfect Nash equilibrium and has a value within a target range, in a way that minimizes the modification cost. We characterize the set of policy profiles that can be installed as the unique equilibrium of some game, and establish sufficient and necessary conditions for successful installation. We propose an efficient algorithm, which solves a convex optimization problem with linear constraints and then performs random perturbation, to obtain a modification plan with a near-optimal cost.
Abstract:As machine learning (ML) is more tightly woven into society, it is imperative that we better characterize ML's strengths and limitations if we are to employ it responsibly. Existing benchmark environments for ML, such as board and video games, offer well-defined benchmarks for progress, but constituent tasks are often complex, and it is frequently unclear how task characteristics contribute to overall difficulty for the machine learner. Likewise, without a systematic assessment of how task characteristics influence difficulty, it is challenging to draw meaningful connections between performance in different benchmark environments. We introduce a novel benchmark environment that offers an enormous range of ML challenges and enables precise examination of how task elements influence practical difficulty. The tool frames learning tasks as a "board-clearing game," which we call the Game of Hidden Rules (GOHR). The environment comprises an expressive rule language and a captive server environment that can be installed locally. We propose a set of benchmark rule-learning tasks and plan to support a performance leader-board for researchers interested in attempting to learn our rules. GOHR complements existing environments by allowing fine, controlled modifications to tasks, enabling experimenters to better understand how each facet of a given learning task contributes to its practical difficulty for an arbitrary ML algorithm.
Abstract:We consider a distributed reinforcement learning setting where multiple agents separately explore the environment and communicate their experiences through a central server. However, $\alpha$-fraction of agents are adversarial and can report arbitrary fake information. Critically, these adversarial agents can collude and their fake data can be of any sizes. We desire to robustly identify a near-optimal policy for the underlying Markov decision process in the presence of these adversarial agents. Our main technical contribution is Weighted-Clique, a novel algorithm for the robust mean estimation from batches problem, that can handle arbitrary batch sizes. Building upon this new estimator, in the offline setting, we design a Byzantine-robust distributed pessimistic value iteration algorithm; in the online setting, we design a Byzantine-robust distributed optimistic value iteration algorithm. Both algorithms obtain near-optimal sample complexities and achieve superior robustness guarantee than prior works.
Abstract:We study the adversarial robustness in offline reinforcement learning. Given a batch dataset consisting of tuples $(s, a, r, s')$, an adversary is allowed to arbitrarily modify $\epsilon$ fraction of the tuples. From the corrupted dataset the learner aims to robustly identify a near-optimal policy. We first show that a worst-case $\Omega(d\epsilon)$ optimality gap is unavoidable in linear MDP of dimension $d$, even if the adversary only corrupts the reward element in a tuple. This contrasts with dimension-free results in robust supervised learning and best-known lower-bound in the online RL setting with corruption. Next, we propose robust variants of the Least-Square Value Iteration (LSVI) algorithm utilizing robust supervised learning oracles, which achieve near-matching performances in cases both with and without full data coverage. The algorithm requires the knowledge of $\epsilon$ to design the pessimism bonus in the no-coverage case. Surprisingly, in this case, the knowledge of $\epsilon$ is necessary, as we show that being adaptive to unknown $\epsilon$ is impossible.This again contrasts with recent results on corruption-robust online RL and implies that robust offline RL is a strictly harder problem.
Abstract:We study the problem of robust reinforcement learning under adversarial corruption on both rewards and transitions. Our attack model assumes an \textit{adaptive} adversary who can arbitrarily corrupt the reward and transition at every step within an episode, for at most $\epsilon$-fraction of the learning episodes. Our attack model is strictly stronger than those considered in prior works. Our first result shows that no algorithm can find a better than $O(\epsilon)$-optimal policy under our attack model. Next, we show that surprisingly the natural policy gradient (NPG) method retains a natural robustness property if the reward corruption is bounded, and can find an $O(\sqrt{\epsilon})$-optimal policy. Consequently, we develop a Filtered Policy Gradient (FPG) algorithm that can tolerate even unbounded reward corruption and can find an $O(\epsilon^{1/4})$-optimal policy. We emphasize that FPG is the first that can achieve a meaningful learning guarantee when a constant fraction of episodes are corrupted. Complimentary to the theoretical results, we show that a neural implementation of FPG achieves strong robust learning performance on the MuJoCo continuous control benchmarks.
Abstract:Stochastic Gradient Descent (SGD) plays a central role in modern machine learning. While there is extensive work on providing error upper bound for SGD, not much is known about SGD error lower bound. In this paper, we study the convergence of constant step-size SGD. We provide error lower bound of SGD for potentially non-convex objective functions with Lipschitz gradients. To our knowledge, this is the first analysis for SGD error lower bound without the strong convexity assumption. We use experiments to illustrate our theoretical results.
Abstract:We investigate optimal adversarial attacks against time series forecast made by autoregressive models. In our setting, the environment evolves according to a potentially nonlinear dynamical system. A linear autoregressive model observes the current environment state and predicts its future values. But an adversary can modify the environment state and hence indirectly manipulate the autoregressive model forecasts. The adversary wants to drive the forecasts towards some adversarial targets while minimizing environment modification. We pose this attack problem as optimal control. When the environment dynamics is linear, we provide a closed-form solution to the optimal attack using Linear Quadratic Regulator (LQR). Otherwise, we propose an approximate attack based on Model Predictive Control (MPC) and iterative LQR (iLQR). Our paper thus connects adversarial learning with control theory. We demonstrate the advantage of our methods empirically.