Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

James Gleeson

Optimizing Data Collection in Deep Reinforcement Learning

Jul 15, 2022

James Gleeson, Daniel Snider, Yvonne Yang, Moshe Gabel, Eyal de Lara, Gennady Pekhimenko

Figure 1 for Optimizing Data Collection in Deep Reinforcement Learning

Figure 2 for Optimizing Data Collection in Deep Reinforcement Learning

Figure 3 for Optimizing Data Collection in Deep Reinforcement Learning

Figure 4 for Optimizing Data Collection in Deep Reinforcement Learning

Abstract:Reinforcement learning (RL) workloads take a notoriously long time to train due to the large number of samples collected at run-time from simulators. Unfortunately, cluster scale-up approaches remain expensive, and commonly used CPU implementations of simulators induce high overhead when switching back and forth between GPU computations. We explore two optimizations that increase RL data collection efficiency by increasing GPU utilization: (1) GPU vectorization: parallelizing simulation on the GPU for increased hardware parallelism, and (2) simulator kernel fusion: fusing multiple simulation steps to run in a single GPU kernel launch to reduce global memory bandwidth requirements. We find that GPU vectorization can achieve up to $1024\times$ speedup over commonly used CPU simulators. We profile the performance of different implementations and show that for a simple simulator, ML compiler implementations (XLA) of GPU vectorization outperform a DNN framework (PyTorch) by $13.4\times$ by reducing CPU overhead from repeated Python to DL backend API calls. We show that simulator kernel fusion speedups with a simple simulator are $11.3\times$ and increase by up to $1024\times$ as simulator complexity increases in terms of memory bandwidth requirements. We show that the speedups from simulator kernel fusion are orthogonal and combinable with GPU vectorization, leading to a multiplicative speedup.

* MLBench 2022 ( https://memani1.github.io/mlbench22/ ) camera ready submission

Via

Access Paper or Ask Questions

RL-Scope: Cross-Stack Profiling for Deep Reinforcement Learning Workloads

Mar 04, 2021

James Gleeson, Srivatsan Krishnan, Moshe Gabel, Vijay Janapa Reddi, Eyal de Lara, Gennady Pekhimenko

Figure 1 for RL-Scope: Cross-Stack Profiling for Deep Reinforcement Learning Workloads

Figure 2 for RL-Scope: Cross-Stack Profiling for Deep Reinforcement Learning Workloads

Figure 3 for RL-Scope: Cross-Stack Profiling for Deep Reinforcement Learning Workloads

Figure 4 for RL-Scope: Cross-Stack Profiling for Deep Reinforcement Learning Workloads

Abstract:Deep reinforcement learning (RL) has made groundbreaking advancements in robotics, data center management and other applications. Unfortunately, system-level bottlenecks in RL workloads are poorly understood; we observe fundamental structural differences in RL workloads that make them inherently less GPU-bound than supervised learning (SL). To explain where training time is spent in RL workloads, we propose RL-Scope, a cross-stack profiler that scopes low-level CPU/GPU resource usage to high-level algorithmic operations, and provides accurate insights by correcting for profiling overhead. Using RL-Scope, we survey RL workloads across its major dimensions including ML backend, RL algorithm, and simulator. For ML backends, we explain a $2.3\times$ difference in runtime between equivalent PyTorch and TensorFlow algorithm implementations, and identify a bottleneck rooted in overly abstracted algorithm implementations. For RL algorithms and simulators, we show that on-policy algorithms are at least $3.5\times$ more simulation-bound than off-policy algorithms. Finally, we profile a scale-up workload and demonstrate that GPU utilization metrics reported by commonly used tools dramatically inflate GPU usage, whereas RL-Scope reports true GPU-bound time. RL-Scope is an open-source tool available at https://github.com/UofT-EcoSystem/rlscope .

* RL-Scope is an open-source tool available at https://github.com/UofT-EcoSystem/rlscope . Proceedings of the 4th MLSys Conference, 2021. Changes: camera ready for MLSys publication -- shorten abstract, add acknowledgements, minor grammar fixes

Via

Access Paper or Ask Questions

Prediction with Restricted Resources and Finite Automata

Dec 10, 2008

Finn Macleod, James Gleeson

Figure 1 for Prediction with Restricted Resources and Finite Automata

Figure 2 for Prediction with Restricted Resources and Finite Automata

Figure 3 for Prediction with Restricted Resources and Finite Automata

Abstract:We obtain an index of the complexity of a random sequence by allowing the role of the measure in classical probability theory to be played by a function we call the generating mechanism. Typically, this generating mechanism will be a finite automata. We generate a set of biased sequences by applying a finite state automata with a specified number, $m$, of states to the set of all binary sequences. Thus we can index the complexity of our random sequence by the number of states of the automata. We detail optimal algorithms to predict sequences generated in this way.

* 13 pages

Via

Access Paper or Ask Questions