Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Malte Schwarzkopf

Locally Constrained Policy Optimization for Online Reinforcement Learning in Non-Stationary Input-Driven Environments

Feb 04, 2023

Pouya Hamadanian, Arash Nasr-Esfahany, Siddartha Sen, Malte Schwarzkopf, Mohammad Alizadeh

Abstract:We study online Reinforcement Learning (RL) in non-stationary input-driven environments, where a time-varying exogenous input process affects the environment dynamics. Online RL is challenging in such environments due to catastrophic forgetting (CF). The agent tends to forget prior knowledge as it trains on new experiences. Prior approaches to mitigate this issue assume task labels (which are often not available in practice) or use off-policy methods that can suffer from instability and poor performance. We present Locally Constrained Policy Optimization (LCPO), an on-policy RL approach that combats CF by anchoring policy outputs on old experiences while optimizing the return on current experiences. To perform this anchoring, LCPO locally constrains policy optimization using samples from experiences that lie outside of the current input distribution. We evaluate LCPO in two gym and computer systems environments with a variety of synthetic and real input traces, and find that it outperforms state-of-the-art on-policy and off-policy RL methods in the online setting, while achieving results on-par with an offline agent pre-trained on the whole input trace.

* 8 pages + 4 pages in the appendix, 9 Figures and 5 Tables

Via

Access Paper or Ask Questions

Reinforcement Learning in Time-Varying Systems: an Empirical Study

Jan 14, 2022

Pouya Hamadanian, Malte Schwarzkopf, Siddartha Sen, Mohammad Alizadeh

Figure 1 for Reinforcement Learning in Time-Varying Systems: an Empirical Study

Figure 2 for Reinforcement Learning in Time-Varying Systems: an Empirical Study

Figure 3 for Reinforcement Learning in Time-Varying Systems: an Empirical Study

Figure 4 for Reinforcement Learning in Time-Varying Systems: an Empirical Study

Abstract:Recent research has turned to Reinforcement Learning (RL) to solve challenging decision problems, as an alternative to hand-tuned heuristics. RL can learn good policies without the need for modeling the environment's dynamics. Despite this promise, RL remains an impractical solution for many real-world systems problems. A particularly challenging case occurs when the environment changes over time, i.e. it exhibits non-stationarity. In this work, we characterize the challenges introduced by non-stationarity and develop a framework for addressing them to train RL agents in live systems. Such agents must explore and learn new environments, without hurting the system's performance, and remember them over time. To this end, our framework (1) identifies different environments encountered by the live system, (2) explores and trains a separate expert policy for each environment, and (3) employs safeguards to protect the system's performance. We apply our framework to two systems problems: straggler mitigation and adaptive video streaming, and evaluate it against a variety of alternative approaches using real-world and synthetic data. We show that each component of our framework is necessary to cope with non-stationarity.

Via

Access Paper or Ask Questions

Learning Scheduling Algorithms for Data Processing Clusters

Oct 12, 2018

Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, Mohammad Alizadeh

Figure 1 for Learning Scheduling Algorithms for Data Processing Clusters

Figure 2 for Learning Scheduling Algorithms for Data Processing Clusters

Figure 3 for Learning Scheduling Algorithms for Data Processing Clusters

Figure 4 for Learning Scheduling Algorithms for Data Processing Clusters

Abstract:Efficiently scheduling data processing jobs on distributed compute clusters requires complex algorithms. Current systems, however, use simple generalized heuristics and ignore workload structure, since developing and tuning a bespoke heuristic for each workload is infeasible. In this paper, we show that modern machine learning techniques can generate highly-efficient policies automatically. Decima uses reinforcement learning (RL) and neural networks to learn workload-specific scheduling algorithms without any human instruction beyond specifying a high-level objective such as minimizing average job completion time. Off-the-shelf RL techniques, however, cannot handle the complexity and scale of the scheduling problem. To build Decima, we had to develop new representations for jobs' dependency graphs, design scalable RL models, and invent new RL training methods for continuous job arrivals. Our prototype integration with Spark on a 25-node cluster shows that Decima outperforms several heuristics, including hand-tuned ones, by at least 21%. Further experiments with an industrial production workload trace demonstrate that Decima delivers up to a 17% reduction in average job completion time and scales to large clusters.

Via

Access Paper or Ask Questions

Variance Reduction for Reinforcement Learning in Input-Driven Environments

Oct 03, 2018

Hongzi Mao, Shaileshh Bojja Venkatakrishnan, Malte Schwarzkopf, Mohammad Alizadeh

Figure 1 for Variance Reduction for Reinforcement Learning in Input-Driven Environments

Figure 2 for Variance Reduction for Reinforcement Learning in Input-Driven Environments

Figure 3 for Variance Reduction for Reinforcement Learning in Input-Driven Environments

Figure 4 for Variance Reduction for Reinforcement Learning in Input-Driven Environments

Abstract:We consider reinforcement learning in input-driven environments, where an exogenous, stochastic input process affects the dynamics of the system. Input processes arise in many applications, including queuing systems, robotics control with disturbances, and object tracking. Since the state dynamics and rewards depend on the input process, the state alone provides limited information for the expected future returns. Therefore, policy gradient methods with standard state-dependent baselines suffer high variance during training. We derive a bias-free, input-dependent baseline to reduce this variance, and analytically show its benefits over state-dependent baselines. We then propose a meta-learning approach to overcome the complexity of learning a baseline that depends on a long sequence of inputs. Our experimental results show that across environments from queuing systems, computer networks, and MuJoCo robotic locomotion, input-dependent baselines consistently improve training stability and result in better eventual policies.

Via

Access Paper or Ask Questions