Abstract:When humans interact with learning-based control systems, a common goal is to minimize a cost function known only to the human. For instance, an exoskeleton may adapt its assistance in an effort to minimize the human's metabolic cost-of-transport. Conventional approaches to synthesizing the learning algorithm solve an inverse problem to infer the human's cost. However, these problems can be ill-posed, hard to solve, or sensitive to problem data. Here we show a game-theoretic learning algorithm that works solely by observing human actions to find the cost minimum, avoiding the need to solve an inverse problem. We evaluate the performance of our algorithm in an extensive set of human subjects experiments, demonstrating consistent convergence to the minimum of a prescribed human cost function in scalar and multidimensional instantiations of the game. We conclude by outlining future directions for theoretical and empirical extensions of our results.
Abstract:We study the repeated principal-agent bandit game, where the principal indirectly interacts with the unknown environment by proposing incentives for the agent to play arms. Most existing work assumes the agent has full knowledge of the reward means and always behaves greedily, but in many online marketplaces, the agent needs to learn the unknown environment and sometimes explore. Motivated by such settings, we model a self-interested learning agent with exploration behaviors who iteratively updates reward estimates and either selects an arm that maximizes the estimated reward plus incentive or explores arbitrarily with a certain probability. As a warm-up, we first consider a self-interested learning agent without exploration. We propose algorithms for both i.i.d. and linear reward settings with bandit feedback in a finite horizon $T$, achieving regret bounds of $\widetilde{O}(\sqrt{T})$ and $\widetilde{O}( T^{2/3} )$, respectively. Specifically, these algorithms are established upon a novel elimination framework coupled with newly-developed search algorithms which accommodate the uncertainty arising from the learning behavior of the agent. We then extend the framework to handle the exploratory learning agent and develop an algorithm to achieve a $\widetilde{O}(T^{2/3})$ regret bound in i.i.d. reward setup by enhancing the robustness of our elimination framework to the potential agent exploration. Finally, when reducing our agent behaviors to the one studied in (Dogan et al., 2023a), we propose an algorithm based on our robust framework, which achieves a $\widetilde{O}(\sqrt{T})$ regret bound, significantly improving upon their $\widetilde{O}(T^{11/12})$ bound.
Abstract:As interactions between humans and AI become more prevalent, it is critical to have better predictors of human behavior in these interactions. We investigated how changes in the AI's adaptive algorithm impact behavior predictions in two-player continuous games. In our experiments, the AI adapted its actions using a gradient descent algorithm under different adaptation rates while human participants were provided cost feedback. The cost feedback was provided by one of two types of visual displays: (a) cost at the current joint action vector, or (b) cost in a local neighborhood of the current joint action vector. Our results demonstrate that AI adaptation rate can significantly affect human behavior, having the ability to shift the outcome between two game theoretic equilibrium. We observed that slow adaptation rates shift the outcome towards the Nash equilibrium, while fast rates shift the outcome towards the human-led Stackelberg equilibrium. The addition of localized cost information had the effect of shifting outcomes towards Nash, compared to the outcomes from cost information at only the current joint action vector. Future work will investigate other effects that influence the convergence of gradient descent games.
Abstract:This paper studies ML systems that interactively learn from users across multiple subpopulations with heterogeneous data distributions. The primary objective is to provide specialized services for different user groups while also predicting user preferences. Once the users select a service based on how well the service anticipated their preference, the services subsequently adapt and refine themselves based on the user data they accumulate, resulting in an iterative, alternating minimization process between users and services (learning dynamics). Employing such tailored approaches has two main challenges: (i) Unknown user preferences: Typically, data on user preferences are unavailable without interaction, and uniform data collection across a large and diverse user base can be prohibitively expensive. (ii) Suboptimal Local Solutions: The total loss (sum of loss functions across all users and all services) landscape is not convex even if the individual losses on a single service are convex, making it likely for the learning dynamics to get stuck in local minima. The final outcome of the aforementioned learning dynamics is thus strongly influenced by the initial set of services offered to users, and is not guaranteed to be close to the globally optimal outcome. In this work, we propose a randomized algorithm to adaptively select very few users to collect preference data from, while simultaneously initializing a set of services. We prove that under mild assumptions on the loss functions, the expected total loss achieved by the algorithm right after initialization is within a factor of the globally optimal total loss with complete user preference data, and this factor scales only logarithmically in the number of services. Our theory is complemented by experiments on real as well as semi-synthetic datasets.
Abstract:We study the sample complexity of identifying the pure strategy Nash equilibrium (PSNE) in a two-player zero-sum matrix game with noise. Formally, we are given a stochastic model where any learner can sample an entry $(i,j)$ of the input matrix $A\in[-1,1]^{n\times m}$ and observe $A_{i,j}+\eta$ where $\eta$ is a zero-mean 1-sub-Gaussian noise. The aim of the learner is to identify the PSNE of $A$, whenever it exists, with high probability while taking as few samples as possible. Zhou et al. (2017) presents an instance-dependent sample complexity lower bound that depends only on the entries in the row and column in which the PSNE lies. We design a near-optimal algorithm whose sample complexity matches the lower bound, up to log factors. The problem of identifying the PSNE also generalizes the problem of pure exploration in stochastic multi-armed bandits and dueling bandits, and our result matches the optimal bounds, up to log factors, in both the settings.
Abstract:This paper considers a variant of zero-sum matrix games where at each timestep the row player chooses row $i$, the column player chooses column $j$, and the row player receives a noisy reward with mean $A_{i,j}$. The objective of the row player is to accumulate as much reward as possible, even against an adversarial column player. If the row player uses the EXP3 strategy, an algorithm known for obtaining $\sqrt{T}$ regret against an arbitrary sequence of rewards, it is immediate that the row player also achieves $\sqrt{T}$ regret relative to the Nash equilibrium in this game setting. However, partly motivated by the fact that the EXP3 strategy is myopic to the structure of the game, O'Donoghue et al. (2021) proposed a UCB-style algorithm that leverages the game structure and demonstrated that this algorithm greatly outperforms EXP3 empirically. While they showed that this UCB-style algorithm achieved $\sqrt{T}$ regret, in this paper we ask if there exists an algorithm that provably achieves $\text{polylog}(T)$ regret against any adversary, analogous to results from stochastic bandits. We propose a novel algorithm that answers this question in the affirmative for the simple $2 \times 2$ setting, providing the first instance-dependent guarantees for games in the regret setting. Our algorithm overcomes two major hurdles: 1) obtaining logarithmic regret even though the Nash equilibrium is estimable only at a $1/\sqrt{T}$ rate, and 2) designing row-player strategies that guarantee that either the adversary provides information about the Nash equilibrium, or the row player incurs negative regret. Moreover, in the full information case we address the general $n \times m$ case where the first hurdle is still relevant. Finally, we show that EXP3 and the UCB-based algorithm necessarily cannot perform better than $\sqrt{T}$.
Abstract:Autocurricular training is an important sub-area of multi-agent reinforcement learning~(MARL) that allows multiple agents to learn emergent skills in an unsupervised co-evolving scheme. The robotics community has experimented autocurricular training with physically grounded problems, such as robust control and interactive manipulation tasks. However, the asymmetric nature of these tasks makes the generation of sophisticated policies challenging. Indeed, the asymmetry in the environment may implicitly or explicitly provide an advantage to a subset of agents which could, in turn, lead to a low-quality equilibrium. This paper proposes a novel game-theoretic algorithm, Stackelberg Multi-Agent Deep Deterministic Policy Gradient (ST-MADDPG), which formulates a two-player MARL problem as a Stackelberg game with one player as the `leader' and the other as the `follower' in a hierarchical interaction structure wherein the leader has an advantage. We first demonstrate that the leader's advantage from ST-MADDPG can be used to alleviate the inherent asymmetry in the environment. By exploiting the leader's advantage, ST-MADDPG improves the quality of a co-evolution process and results in more sophisticated and complex strategies that work well even against an unseen strong opponent.
Abstract:Adaptive machines have the potential to assist or interfere with human behavior in a range of contexts, from cognitive decision-making to physical device assistance. Therefore it is critical to understand how machine learning algorithms can influence human actions, particularly in situations where machine goals are misaligned with those of people. Since humans continually adapt to their environment using a combination of explicit and implicit strategies, when the environment contains an adaptive machine, the human and machine play a game. Game theory is an established framework for modeling interactions between two or more decision-makers that has been applied extensively in economic markets and machine algorithms. However, existing approaches make assumptions about, rather than empirically test, how adaptation by individual humans is affected by interaction with an adaptive machine. Here we tested learning algorithms for machines playing general-sum games with human subjects. Our algorithms enable the machine to select the outcome of the co-adaptive interaction from a constellation of game-theoretic equilibria in action and policy spaces. Importantly, the machine learning algorithms work directly from observations of human actions without solving an inverse problem to estimate the human's utility function as in prior work. Surprisingly, one algorithm can steer the human-machine interaction to the machine's optimum, effectively controlling the human's actions even while the human responds optimally to their perceived cost landscape. Our results show that game theory can be used to predict and design outcomes of co-adaptive interactions between intelligent humans and machines.
Abstract:We study the sample complexity of identifying an approximate equilibrium for two-player zero-sum $n\times 2$ matrix games. That is, in a sequence of repeated game plays, how many rounds must the two players play before reaching an approximate equilibrium (e.g., Nash)? We derive instance-dependent bounds that define an ordering over game matrices that captures the intuition that the dynamics of some games converge faster than others. Specifically, we consider a stochastic observation model such that when the two players choose actions $i$ and $j$, respectively, they both observe each other's played actions and a stochastic observation $X_{ij}$ such that $\mathbb E[ X_{ij}] = A_{ij}$. To our knowledge, our work is the first case of instance-dependent lower bounds on the number of rounds the players must play before reaching an approximate equilibrium in the sense that the number of rounds depends on the specific properties of the game matrix $A$ as well as the desired accuracy. We also prove a converse statement: there exist player strategies that achieve this lower bound.
Abstract:Prediction systems face exogenous and endogenous distribution shift -- the world constantly changes, and the predictions the system makes change the environment in which it operates. For example, a music recommender observes exogeneous changes in the user distribution as different communities have increased access to high speed internet. If users under the age of 18 enjoy their recommendations, the proportion of the user base comprised of those under 18 may endogeneously increase. Most of the study of endogenous shifts has focused on the single decision-maker setting, where there is one learner that users either choose to use or not. This paper studies participation dynamics between sub-populations and possibly many learners. We study the behavior of systems with \emph{risk-reducing} learners and sub-populations. A risk-reducing learner updates their decision upon observing a mixture distribution of the sub-populations $\mathcal{D}$ in such a way that it decreases the risk of the learner on that mixture. A risk reducing sub-population updates its apportionment amongst learners in a way which reduces its overall loss. Previous work on the single learner case shows that myopic risk minimization can result in high overall loss~\citep{perdomo2020performative, miller2021outside} and representation disparity~\citep{hashimoto2018fairness, zhang2019group}. Our work analyzes the outcomes of multiple myopic learners and market forces, often leading to better global loss and less representation disparity.