Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Abhijit Mazumdar

Provably Safe Reinforcement Learning for Stochastic Reach-Avoid Problems with Entropy Regularization

Jan 15, 2026

Abhijit Mazumdar, Rafal Wisniewski, Manuela L. Bujorianu

Abstract:We consider the problem of learning the optimal policy for Markov decision processes with safety constraints. We formulate the problem in a reach-avoid setup. Our goal is to design online reinforcement learning algorithms that ensure safety constraints with arbitrarily high probability during the learning phase. To this end, we first propose an algorithm based on the optimism in the face of uncertainty (OFU) principle. Based on the first algorithm, we propose our main algorithm, which utilizes entropy regularization. We investigate the finite-sample analysis of both algorithms and derive their regret bounds. We demonstrate that the inclusion of entropy regularization improves the regret and drastically controls the episode-to-episode variability that is inherent in OFU-based safe RL algorithms.

Via

Access Paper or Ask Questions

Provably Safe Reinforcement Learning using Entropy Regularizer

Jan 13, 2026

Abhijit Mazumdar, Rafal Wisniewski, Manuela L. Bujorianu

Via

Access Paper or Ask Questions

Safe Reinforcement Learning for Constrained Markov Decision Processes with Stochastic Stopping Time

Mar 23, 2024

Abhijit Mazumdar, Rafal Wisniewski, Manuela L. Bujorianu

Abstract:In this paper, we present an online reinforcement learning algorithm for constrained Markov decision processes with a safety constraint. Despite the necessary attention of the scientific community, considering stochastic stopping time, the problem of learning optimal policy without violating safety constraints during the learning phase is yet to be addressed. To this end, we propose an algorithm based on linear programming that does not require a process model. We show that the learned policy is safe with high confidence. We also propose a method to compute a safe baseline policy, which is central in developing algorithms that do not violate the safety constraints. Finally, we provide simulation results to show the efficacy of the proposed algorithm. Further, we demonstrate that efficient exploration can be achieved by defining a subset of the state-space called proxy set.

Via

Access Paper or Ask Questions