Abstract:The study of collaborative multi-agent bandits has attracted significant attention recently. In light of this, we initiate the study of a new collaborative setting, consisting of $N$ agents such that each agent is learning one of $M$ stochastic multi-armed bandits to minimize their group cumulative regret. We develop decentralized algorithms which facilitate collaboration between the agents under two scenarios. We characterize the performance of these algorithms by deriving the per agent cumulative regret and group regret upper bounds. We also prove lower bounds for the group regret in this setting, which demonstrates the near-optimal behavior of the proposed algorithms.
Abstract:We study a multi-agent stochastic linear bandit with side information, parameterized by an unknown vector $\theta^* \in \mathbb{R}^d$. The side information consists of a finite collection of low-dimensional subspaces, one of which contains $\theta^*$. In our setting, agents can collaborate to reduce regret by sending recommendations across a communication graph connecting them. We present a novel decentralized algorithm, where agents communicate subspace indices with each other, and each agent plays a projected variant of LinUCB on the corresponding (low-dimensional) subspace. Through a combination of collaborative best subspace identification, and per-agent learning of an unknown vector in the corresponding low-dimensional subspace, we show that the per-agent regret is much smaller than the case when agents do not communicate. By collaborating to identify the subspace containing $\theta^*$, we show that each agent effectively solves an easier instance of the linear bandit (compared to the case of no collaboration), thus leading to the reduced per-agent regret. We finally complement these results through simulations.
Abstract:We consider a decentralized multi-agent Multi Armed Bandit (MAB) setup consisting of $N$ agents, solving the same MAB instance to minimize individual cumulative regret. In our model, agents collaborate by exchanging messages through pairwise gossip style communications on an arbitrary connected graph. We develop two novel algorithms, where each agent only plays from a subset of all the arms. Agents use the communication medium to recommend only arm-IDs (not samples), and thus update the set of arms from which they play. We establish that, if agents communicate $\Omega(\log(T))$ times through any connected pairwise gossip mechanism, then every agent's regret is a factor of order $N$ smaller compared to the case of no collaborations. Furthermore, we show that the communication constraints only have a second order effect on the regret of our algorithm. We then analyze this second order term of the regret to derive bounds on the regret-communication tradeoffs. Finally, we empirically evaluate our algorithm and conclude that the insights are fundamental and not artifacts of our bounds. We also show a lower bound which gives that the regret scaling obtained by our algorithm cannot be improved even in the absence of any communication constraints. Our results thus demonstrate that even a minimal level of collaboration among agents greatly reduces regret for all agents.