Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Evgeny Bolotin

GPU Domain Specialization via Composable On-Package Architecture

Apr 05, 2021

Yaosheng Fu, Evgeny Bolotin, Niladrish Chatterjee, David Nellans, Stephen W. Keckler

Figure 1 for GPU Domain Specialization via Composable On-Package Architecture

Figure 2 for GPU Domain Specialization via Composable On-Package Architecture

Figure 3 for GPU Domain Specialization via Composable On-Package Architecture

Figure 4 for GPU Domain Specialization via Composable On-Package Architecture

Abstract:As GPUs scale their low precision matrix math throughput to boost deep learning (DL) performance, they upset the balance between math throughput and memory system capabilities. We demonstrate that converged GPU design trying to address diverging architectural requirements between FP32 (or larger) based HPC and FP16 (or smaller) based DL workloads results in sub-optimal configuration for either of the application domains. We argue that a Composable On-PAckage GPU (COPAGPU) architecture to provide domain-specialized GPU products is the most practical solution to these diverging requirements. A COPA-GPU leverages multi-chip-module disaggregation to support maximal design reuse, along with memory system specialization per application domain. We show how a COPA-GPU enables DL-specialized products by modular augmentation of the baseline GPU architecture with up to 4x higher off-die bandwidth, 32x larger on-package cache, 2.3x higher DRAM bandwidth and capacity, while conveniently supporting scaled-down HPC-oriented designs. This work explores the microarchitectural design necessary to enable composable GPUs and evaluates the benefits composability can provide to HPC, DL training, and DL inference. We show that when compared to a converged GPU design, a DL-optimized COPA-GPU featuring a combination of 16x larger cache capacity and 1.6x higher DRAM bandwidth scales per-GPU training and inference performance by 31% and 35% respectively and reduces the number of GPU instances by 50% in scale-out training scenarios.

Via

Access Paper or Ask Questions

The Architectural Implications of Distributed Reinforcement Learning on CPU-GPU Systems

Dec 08, 2020

Ahmet Inci, Evgeny Bolotin, Yaosheng Fu, Gal Dalal, Shie Mannor, David Nellans, Diana Marculescu

Figure 1 for The Architectural Implications of Distributed Reinforcement Learning on CPU-GPU Systems

Figure 2 for The Architectural Implications of Distributed Reinforcement Learning on CPU-GPU Systems

Figure 3 for The Architectural Implications of Distributed Reinforcement Learning on CPU-GPU Systems

Figure 4 for The Architectural Implications of Distributed Reinforcement Learning on CPU-GPU Systems

Abstract:With deep reinforcement learning (RL) methods achieving results that exceed human capabilities in games, robotics, and simulated environments, continued scaling of RL training is crucial to its deployment in solving complex real-world problems. However, improving the performance scalability and power efficiency of RL training through understanding the architectural implications of CPU-GPU systems remains an open problem. In this work we investigate and improve the performance and power efficiency of distributed RL training on CPU-GPU systems by approaching the problem not solely from the GPU microarchitecture perspective but following a holistic system-level analysis approach. We quantify the overall hardware utilization on a state-of-the-art distributed RL training framework and empirically identify the bottlenecks caused by GPU microarchitectural, algorithmic, and system-level design choices. We show that the GPU microarchitecture itself is well-balanced for state-of-the-art RL frameworks, but further investigation reveals that the number of actors running the environment interactions and the amount of hardware resources available to them are the primary performance and power efficiency limiters. To this end, we introduce a new system design metric, CPU/GPU ratio, and show how to find the optimal balance between CPU and GPU resources when designing scalable and efficient CPU-GPU systems for RL training.

* To appear in the proceedings of the 6th Workshop on Energy Efficient Machine Learning and Cognitive Computing (EMC2) 2020

Via

Access Paper or Ask Questions