Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Machine Learning on Volatile Instances

Mar 12, 2020

Xiaoxi Zhang, Jianyu Wang, Gauri Joshi, Carlee Joe-Wong

Figure 1 for Machine Learning on Volatile Instances

Figure 2 for Machine Learning on Volatile Instances

Figure 3 for Machine Learning on Volatile Instances

Figure 4 for Machine Learning on Volatile Instances

Share this with someone who'll enjoy it:

Abstract:Due to the massive size of the neural network models and training datasets used in machine learning today, it is imperative to distribute stochastic gradient descent (SGD) by splitting up tasks such as gradient evaluation across multiple worker nodes. However, running distributed SGD can be prohibitively expensive because it may require specialized computing resources such as GPUs for extended periods of time. We propose cost-effective strategies to exploit volatile cloud instances that are cheaper than standard instances, but may be interrupted by higher priority workloads. To the best of our knowledge, this work is the first to quantify how variations in the number of active worker nodes (as a result of preemption) affects SGD convergence and the time to train the model. By understanding these trade-offs between preemption probability of the instances, accuracy, and training time, we are able to derive practical strategies for configuring distributed SGD jobs on volatile instances such as Amazon EC2 spot instances and other preemptible cloud instances. Experimental results show that our strategies achieve good training performance at substantially lower cost.

View paper on

Share this with someone who'll enjoy it:

Title:Machine Learning on Volatile Instances

Paper and Code