Abstract:Single-player contextual bandits are a well-studied problem in reinforcement learning that has seen applications in various fields such as advertising, healthcare, and finance. In light of the recent work on \emph{information asymmetric} bandits \cite{chang2022online, chang2023online}, we propose a novel multiplayer information asymmetric contextual bandit framework where there are multiple players each with their own set of actions. At every round, they observe the same context vectors and simultaneously take an action from their own set of actions, giving rise to a joint action. However, upon taking this action the players are subjected to information asymmetry in (1) actions and/or (2) rewards. We designed an algorithm \texttt{LinUCB} by modifying the classical single-player algorithm \texttt{LinUCB} in \cite{chu2011contextual} to achieve the optimal regret $O(\sqrt{T})$ when only one kind of asymmetry is present. We then propose a novel algorithm \texttt{ETC} that is built on explore-then-commit principles to achieve the same optimal regret when both types of asymmetry are present.
Abstract:We consider a cooperative multiplayer bandit learning problem where the players are only allowed to agree on a strategy beforehand, but cannot communicate during the learning process. In this problem, each player simultaneously selects an action. Based on the actions selected by all players, the team of players receives a reward. The actions of all the players are commonly observed. However, each player receives a noisy version of the reward which cannot be shared with other players. Since players receive potentially different rewards, there is an asymmetry in the information used to select their actions. In this paper, we provide an algorithm based on upper and lower confidence bounds that the players can use to select their optimal actions despite the asymmetry in the reward information. We show that this algorithm can achieve logarithmic $O(\frac{\log T}{\Delta_{\bm{a}}})$ (gap-dependent) regret as well as $O(\sqrt{T\log T})$ (gap-independent) regret. This is asymptotically optimal in $T$. We also show that it performs empirically better than the current state of the art algorithm for this environment.