Abstract:We study a hinted heterogeneous multi-agent multi-armed bandits problem (HMA2B), where agents can query low-cost observations (hints) in addition to pulling arms. In this framework, each of the $M$ agents has a unique reward distribution over $K$ arms, and in $T$ rounds, they can observe the reward of the arm they pull only if no other agent pulls that arm. The goal is to maximize the total utility by querying the minimal necessary hints without pulling arms, achieving time-independent regret. We study HMA2B in both centralized and decentralized setups. Our main centralized algorithm, GP-HCLA, which is an extension of HCLA, uses a central decision-maker for arm-pulling and hint queries, achieving $O(M^4K)$ regret with $O(MK\log T)$ adaptive hints. In decentralized setups, we propose two algorithms, HD-ETC and EBHD-ETC, that allow agents to choose actions independently through collision-based communication and query hints uniformly until stopping, yielding $O(M^3K^2)$ regret with $O(M^3K\log T)$ hints, where the former requires knowledge of the minimum gap and the latter does not. Finally, we establish lower bounds to prove the optimality of our results and verify them through numerical simulations.
Abstract:The classic cake-cutting problem provides a model for addressing the fair and efficient allocation of a divisible, heterogeneous resource among agents with distinct preferences. Focusing on a standard formulation of cake cutting, in which each agent must receive a contiguous piece of the cake in an offline setting, this work instead focuses on online allocating non-contiguous pieces of cake among agents and establishes algorithmic results for fairness measures. In this regard, we made use of classic adversarial multi-armed bandits to achieve sub-linear Fairness and Revenue Regret at the same time. Adversarial bandits are powerful tools to model the adversarial reinforcement learning environments, that provide strong upper-bounds for regret of learning with just observing one action's reward in each step by applying smart trade-off between exploration and exploitation. This work studies the power of the famous EXP_3 algorithm that is based on exponential wight{-}importance updating probability distribution through time horizon.