Abstract:In an offline reinforcement learning setting, the safe policy improvement (SPI) problem aims to improve the performance of a behavior policy according to which sample data has been generated. State-of-the-art approaches to SPI require a high number of samples to provide practical probabilistic guarantees on the improved policy's performance. We present a novel approach to the SPI problem that provides the means to require less data for such guarantees. Specifically, to prove the correctness of these guarantees, we devise implicit transformations on the data set and the underlying environment model that serve as theoretical foundations to derive tighter improvement bounds for SPI. Our empirical evaluation, using the well-established SPI with baseline bootstrapping (SPIBB) algorithm, on standard benchmarks shows that our method indeed significantly reduces the sample complexity of the SPIBB algorithm.
Abstract:A central task in control theory, artificial intelligence, and formal methods is to synthesize reward-maximizing strategies for agents that operate in partially unknown environments. In environments modeled by gray-box Markov decision processes (MDPs), the impact of the agents' actions are known in terms of successor states but not the stochastics involved. In this paper, we devise a strategy synthesis algorithm for gray-box MDPs via reinforcement learning that utilizes interval MDPs as internal model. To compete with limited sampling access in reinforcement learning, we incorporate two novel concepts into our algorithm, focusing on rapid and successful learning rather than on stochastic guarantees and optimality: lower confidence bound exploration reinforces variants of already learned practical strategies and action scoping reduces the learning action space to promising actions. We illustrate benefits of our algorithms by means of a prototypical implementation applied on examples from the AI and formal methods communities.
Abstract:Bayesian networks (BNs) are a probabilistic graphical model widely used for representing expert knowledge and reasoning under uncertainty. Traditionally, they are based on directed acyclic graphs that capture dependencies between random variables. However, directed cycles can naturally arise when cross-dependencies between random variables exist, e.g., for modeling feedback loops. Existing methods to deal with such cross-dependencies usually rely on reductions to BNs without cycles. These approaches are fragile to generalize, since their justifications are intermingled with additional knowledge about the application context. In this paper, we present a foundational study regarding semantics for cyclic BNs that are generic and conservatively extend the cycle-free setting. First, we propose constraint-based semantics that specify requirements for full joint distributions over a BN to be consistent with the local conditional probabilities and independencies. Second, two kinds of limit semantics that formalize infinite unfolding approaches are introduced and shown to be computable by a Markov chain construction.