Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Derek Murray

Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers

Feb 02, 2022

Youjie Li, Amar Phanishayee, Derek Murray, Jakub Tarnawski, Nam Sung Kim

Figure 1 for Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers

Figure 2 for Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers

Figure 3 for Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers

Figure 4 for Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers

Abstract:Deep neural networks (DNNs) have grown exponentially in complexity and size over the past decade, leaving only those who have access to massive datacenter-based resources with the ability to develop and train such models. One of the main challenges for the long tail of researchers who might have access to only limited resources (e.g., a single multi-GPU server) is limited GPU memory capacity compared to model size. The problem is so acute that the memory requirement of training large DNN models can often exceed the aggregate capacity of all available GPUs on commodity servers; this problem only gets worse with the trend of ever-growing model sizes. Current solutions that rely on virtualizing GPU memory (by swapping to/from CPU memory) incur excessive swapping overhead. In this paper, we present a new training framework, Harmony, and advocate rethinking how DNN frameworks schedule computation and move data to push the boundaries of training large models efficiently on modest multi-GPU deployments. Across many large DNN models, Harmony is able to reduce swap load by up to two orders of magnitude and obtain a training throughput speedup of up to 7.6x over highly optimized baselines with virtualized memory.

Via

Access Paper or Ask Questions

SysML: The New Frontier of Machine Learning Systems

May 01, 2019

Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung(+59 more)

Abstract:Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a new systems machine learning research community at the intersection of the traditional systems and ML communities, focused on topics such as hardware systems for ML, software systems for ML, and ML optimized for metrics beyond predictive accuracy. To do this, we describe a new conference, SysML, that explicitly targets research at the intersection of systems and machine learning with a program committee split evenly between experts in systems and ML, and an explicit focus on topics at the intersection of the two.

Via

Access Paper or Ask Questions

Dynamic Control Flow in Large-Scale Machine Learning

May 04, 2018

Yuan Yu, Martín Abadi, Paul Barham, Eugene Brevdo, Mike Burrows, Andy Davis, Jeff Dean, Sanjay Ghemawat, Tim Harley, Peter Hawkins(+5 more)

Figure 1 for Dynamic Control Flow in Large-Scale Machine Learning

Figure 2 for Dynamic Control Flow in Large-Scale Machine Learning

Figure 3 for Dynamic Control Flow in Large-Scale Machine Learning

Figure 4 for Dynamic Control Flow in Large-Scale Machine Learning

Abstract:Many recent machine learning models rely on fine-grained dynamic control flow for training and inference. In particular, models based on recurrent neural networks and on reinforcement learning depend on recurrence relations, data-dependent conditional execution, and other features that call for dynamic control flow. These applications benefit from the ability to make rapid control-flow decisions across a set of computing devices in a distributed system. For performance, scalability, and expressiveness, a machine learning system must support dynamic control flow in distributed and heterogeneous environments. This paper presents a programming model for distributed machine learning that supports dynamic control flow. We describe the design of the programming model, and its implementation in TensorFlow, a distributed machine learning system. Our approach extends the use of dataflow graphs to represent machine learning models, offering several distinctive features. First, the branches of conditionals and bodies of loops can be partitioned across many machines to run on a set of heterogeneous devices, including CPUs, GPUs, and custom ASICs. Second, programs written in our model support automatic differentiation and distributed gradient computations, which are necessary for training machine learning models that use control flow. Third, our choice of non-strict semantics enables multiple loop iterations to execute in parallel across machines, and to overlap compute and I/O operations. We have done our work in the context of TensorFlow, and it has been used extensively in research and production. We evaluate it using several real-world applications, and demonstrate its performance and scalability.

* EuroSys 2018: Thirteenth EuroSys Conference, April 23-26, 2018, Porto, Portugal. ACM, New York, NY, USA
* Appeared in EuroSys 2018. 14 pages, 16 figures

Via

Access Paper or Ask Questions

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Mar 16, 2016

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin(+30 more)

Figure 1 for TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Figure 2 for TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Figure 3 for TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Figure 4 for TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Abstract:TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.

* Version 2 updates only the metadata, to correct the formatting of Mart\'in Abadi's name

Via

Access Paper or Ask Questions