Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Pratik Gajane

FT R and D

Adversarial Multi-dueling Bandits

Jun 18, 2024

Pratik Gajane

Abstract:We introduce the problem of regret minimization in adversarial multi-dueling bandits. While adversarial preferences have been studied in dueling bandits, they have not been explored in multi-dueling bandits. In this setting, the learner is required to select $m \geq 2$ arms at each round and observes as feedback the identity of the most preferred arm which is based on an arbitrary preference matrix chosen obliviously. We introduce a novel algorithm, MiDEX (Multi Dueling EXP3), to learn from such preference feedback that is assumed to be generated from a pairwise-subset choice model. We prove that the expected cumulative $T$-round regret of MiDEX compared to a Borda-winner from a set of $K$ arms is upper bounded by $O((K \log K)^{1/3} T^{2/3})$. Moreover, we prove a lower bound of $\Omega(K^{1/3} T^{2/3})$ for the expected regret in this setting which demonstrates that our proposed algorithm is near-optimal.

Via

Access Paper or Ask Questions

Investigating Gender Fairness in Machine Learning-driven Personalized Care for Chronic Pain

Feb 29, 2024

Pratik Gajane, Sean Newman, John D. Piette

Abstract:This study investigates gender fairness in personalized pain care recommendations using machine learning algorithms. Leveraging a contextual bandits framework, personalized recommendations are formulated and evaluated using LinUCB algorithm on a dataset comprising interactions with $164$ patients across $10$ sessions each. Results indicate that while adjustments to algorithm parameters influence the quality of pain care recommendations, this impact remains consistent across genders. However, when certain patient information, such as self-reported pain measurements, is absent, the quality of pain care recommendations for women is notably inferior to that for men.

Via

Access Paper or Ask Questions

Provably Efficient Exploration in Constrained Reinforcement Learning:Posterior Sampling Is All You Need

Sep 27, 2023

Danil Provodin, Pratik Gajane, Mykola Pechenizkiy, Maurits Kaptein

Abstract:We present a new algorithm based on posterior sampling for learning in constrained Markov decision processes (CMDP) in the infinite-horizon undiscounted setting. The algorithm achieves near-optimal regret bounds while being advantageous empirically compared to the existing algorithms. Our main theoretical result is a Bayesian regret bound for each cost component of \tilde{O} (HS \sqrt{AT}) for any communicating CMDP with S states, A actions, and bound on the hitting time H. This regret bound matches the lower bound in order of time horizon T and is the best-known regret bound for communicating CMDPs in the infinite-horizon undiscounted setting. Empirical results show that, despite its simplicity, our posterior sampling algorithm outperforms the existing algorithms for constrained reinforcement learning.

Via

Access Paper or Ask Questions

Multi-Armed Bandits with Generalized Temporally-Partitioned Rewards

Mar 01, 2023

Ronald C. van den Broek, Rik Litjens, Tobias Sagis, Luc Siecker, Nina Verbeeke, Pratik Gajane

Figure 1 for Multi-Armed Bandits with Generalized Temporally-Partitioned Rewards

Figure 2 for Multi-Armed Bandits with Generalized Temporally-Partitioned Rewards

Figure 3 for Multi-Armed Bandits with Generalized Temporally-Partitioned Rewards

Figure 4 for Multi-Armed Bandits with Generalized Temporally-Partitioned Rewards

Abstract:Decision-making problems of sequential nature, where decisions made in the past may have an impact on the future, are used to model many practically important applications. In some real-world applications, feedback about a decision is delayed and may arrive via partial rewards that are observed with different delays. Motivated by such scenarios, we propose a novel problem formulation called multi-armed bandits with generalized temporally-partitioned rewards. To formalize how feedback about a decision is partitioned across several time steps, we introduce $\beta$-spread property. We derive a lower bound on the performance of any uniformly efficient algorithm for the considered problem. Moreover, we provide an algorithm called TP-UCB-FR-G and prove an upper bound on its performance measure. In some scenarios, our upper bound improves upon the state of the art. We provide experimental results validating the proposed algorithm and our theoretical results.

Via

Access Paper or Ask Questions

Curiosity-driven Exploration in Sparse-reward Multi-agent Reinforcement Learning

Feb 21, 2023

Jiong Li, Pratik Gajane

Abstract:Sparsity of rewards while applying a deep reinforcement learning method negatively affects its sample-efficiency. A viable solution to deal with the sparsity of rewards is to learn via intrinsic motivation which advocates for adding an intrinsic reward to the reward function to encourage the agent to explore the environment and expand the sample space. Though intrinsic motivation methods are widely used to improve data-efficient learning in the reinforcement learning model, they also suffer from the so-called detachment problem. In this article, we discuss the limitations of intrinsic curiosity module in sparse-reward multi-agent reinforcement learning and propose a method called I-Go-Explore that combines the intrinsic curiosity module with the Go-Explore framework to alleviate the detachment problem.

Via

Access Paper or Ask Questions

Local Differential Privacy for Sequential Decision Making in a Changing Environment

Jan 02, 2023

Pratik Gajane

Abstract:We study the problem of preserving privacy while still providing high utility in sequential decision making scenarios in a changing environment. We consider abruptly changing environment: the environment remains constant during periods and it changes at unknown time instants. To formulate this problem, we propose a variant of multi-armed bandits called non-stationary stochastic corrupt bandits. We construct an algorithm called SW-KLUCB-CF and prove an upper bound on its utility using the performance measure of regret. The proven regret upper bound for SW-KLUCB-CF is near-optimal in the number of time steps and matches the best known bound for analogous problems in terms of the number of time steps and the number of changes. Moreover, we present a provably optimal mechanism which can guarantee the desired level of local differential privacy while providing high utility.

* Accepted at AAAI Privacy Preserving Artificial Intelligence (PPAI), 2023. arXiv admin note: text overlap with arXiv:1708.05033

Via

Access Paper or Ask Questions

Generalizing distribution of partial rewards for multi-armed bandits with temporally-partitioned rewards

Nov 13, 2022

Ronald C. van den Broek, Rik Litjens, Tobias Sagis, Luc Siecker, Nina Verbeeke, Pratik Gajane

Figure 1 for Generalizing distribution of partial rewards for multi-armed bandits with temporally-partitioned rewards

Abstract:We investigate the Multi-Armed Bandit problem with Temporally-Partitioned Rewards (TP-MAB) setting in this paper. In the TP-MAB setting, an agent will receive subsets of the reward over multiple rounds rather than the entire reward for the arm all at once. In this paper, we introduce a general formulation of how an arm's cumulative reward is distributed across several rounds, called Beta-spread property. Such a generalization is needed to be able to handle partitioned rewards in which the maximum reward per round is not distributed uniformly across rounds. We derive a lower bound on the TP-MAB problem under the assumption that Beta-spread holds. Moreover, we provide an algorithm TP-UCB-FR-G, which uses the Beta-spread property to improve the regret upper bound in some scenarios. By generalizing how the cumulative reward is distributed, this setting is applicable in a broader range of applications.

Via

Access Paper or Ask Questions

An Empirical Evaluation of Posterior Sampling for Constrained Reinforcement Learning

Sep 08, 2022

Danil Provodin, Pratik Gajane, Mykola Pechenizkiy, Maurits Kaptein

Figure 1 for An Empirical Evaluation of Posterior Sampling for Constrained Reinforcement Learning

Figure 2 for An Empirical Evaluation of Posterior Sampling for Constrained Reinforcement Learning

Figure 3 for An Empirical Evaluation of Posterior Sampling for Constrained Reinforcement Learning

Figure 4 for An Empirical Evaluation of Posterior Sampling for Constrained Reinforcement Learning

Abstract:We study a posterior sampling approach to efficient exploration in constrained reinforcement learning. Alternatively to existing algorithms, we propose two simple algorithms that are more efficient statistically, simpler to implement and computationally cheaper. The first algorithm is based on a linear formulation of CMDP, and the second algorithm leverages the saddle-point formulation of CMDP. Our empirical results demonstrate that, despite its simplicity, posterior sampling achieves state-of-the-art performance and, in some cases, significantly outperforms optimistic algorithms.

Via

Access Paper or Ask Questions

Survey on Fair Reinforcement Learning: Theory and Practice

May 20, 2022

Pratik Gajane, Akrati Saxena, Maryam Tavakol, George Fletcher, Mykola Pechenizkiy

Figure 1 for Survey on Fair Reinforcement Learning: Theory and Practice

Abstract:Fairness-aware learning aims at satisfying various fairness constraints in addition to the usual performance criteria via data-driven machine learning techniques. Most of the research in fairness-aware learning employs the setting of fair-supervised learning. However, many dynamic real-world applications can be better modeled using sequential decision-making problems and fair reinforcement learning provides a more suitable alternative for addressing these problems. In this article, we provide an extensive overview of fairness approaches that have been implemented via a reinforcement learning (RL) framework. We discuss various practical applications in which RL methods have been applied to achieve a fair solution with high accuracy. We further include various facets of the theory of fair reinforcement learning, organizing them into single-agent RL, multi-agent RL, long-term fairness via RL, and offline learning. Moreover, we highlight a few major issues to explore in order to advance the field of fair-RL, namely - i) correcting societal biases, ii) feasibility of group fairness or individual fairness, and iii) explainability in RL. Our work is beneficial for both researchers and practitioners as we discuss articles providing mathematical guarantees as well as articles with empirical studies on real-world problems.

Via

Access Paper or Ask Questions

The Impact of Batch Learning in Stochastic Linear Bandits

Feb 14, 2022

Danil Provodin, Pratik Gajane, Mykola Pechenizkiy, Maurits Kaptein

Figure 1 for The Impact of Batch Learning in Stochastic Linear Bandits

Figure 2 for The Impact of Batch Learning in Stochastic Linear Bandits

Abstract:We consider a special case of bandit problems, named batched bandits, in which an agent observes batches of responses over a certain time period. Unlike previous work, we consider a practically relevant batch-centric scenario of batch learning. That is to say, we provide a policy-agnostic regret analysis and demonstrate upper and lower bounds for the regret of a candidate policy. Our main theoretical results show that the impact of batch learning can be measured proportional to the regret of online behavior. Primarily, we study two settings of the problem: instance-independent and instance-dependent. While the upper bound is the same for both settings, the worst-case lower bound is more comprehensive in the former case and more accurate in the latter one. Also, we provide a more robust result for the 2-armed bandit problem as an important insight. Finally, we demonstrate the consistency of theoretical results by conducting empirical experiments and reflect on the optimal batch size choice.

* arXiv admin note: substantial text overlap with arXiv:2111.02071

Via

Access Paper or Ask Questions