Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Muthian Sivathanu

Microsoft

Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads

Feb 21, 2022

Dharma Shukla, Muthian Sivathanu, Srinidhi Viswanatha, Bhargav Gulavani, Rimma Nehme, Amey Agrawal, Chen Chen, Nipun Kwatra, Ramachandran Ramjee, Pankaj Sharma(+16 more)

Figure 1 for Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads

Figure 2 for Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads

Figure 3 for Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads

Figure 4 for Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads

Abstract:Lowering costs by driving high utilization across deep learning workloads is a crucial lever for cloud providers. We present Singularity, Microsoft's globally distributed scheduling service for highly-efficient and reliable execution of deep learning training and inference workloads. At the heart of Singularity is a novel, workload-aware scheduler that can transparently preempt and elastically scale deep learning workloads to drive high utilization without impacting their correctness or performance, across a global fleet of AI accelerators (e.g., GPUs, FPGAs). All jobs in Singularity are preemptable, migratable, and dynamically resizable (elastic) by default: a live job can be dynamically and transparently (a) preempted and migrated to a different set of nodes, cluster, data center or a region and resumed exactly from the point where the execution was preempted, and (b) resized (i.e., elastically scaled-up/down) on a varying set of accelerators of a given type. Our mechanisms are transparent in that they do not require the user to make any changes to their code or require using any custom libraries that may limit flexibility. Additionally, our approach significantly improves the reliability of deep learning workloads. We show that the resulting efficiency and reliability gains with Singularity are achieved with negligible impact on the steady-state performance. Finally, our design approach is agnostic of DNN architectures and handles a variety of parallelism strategies (e.g., data/pipeline/model parallelism).

* Revision: Fixed some typos

Via

Access Paper or Ask Questions

LRTuner: A Learning Rate Tuner for Deep Neural Networks

May 30, 2021

Nikhil Iyer, V Thejas, Nipun Kwatra, Ramachandran Ramjee, Muthian Sivathanu

Figure 1 for LRTuner: A Learning Rate Tuner for Deep Neural Networks

Figure 2 for LRTuner: A Learning Rate Tuner for Deep Neural Networks

Figure 3 for LRTuner: A Learning Rate Tuner for Deep Neural Networks

Figure 4 for LRTuner: A Learning Rate Tuner for Deep Neural Networks

Abstract:One very important hyperparameter for training deep neural networks is the learning rate schedule of the optimizer. The choice of learning rate schedule determines the computational cost of getting close to a minima, how close you actually get to the minima, and most importantly the kind of local minima (wide/narrow) attained. The kind of minima attained has a significant impact on the generalization accuracy of the network. Current systems employ hand tuned learning rate schedules, which are painstakingly tuned for each network and dataset. Given that the state space of schedules is huge, finding a satisfactory learning rate schedule can be very time consuming. In this paper, we present LRTuner, a method for tuning the learning rate as training proceeds. Our method works with any optimizer, and we demonstrate results on SGD with Momentum, and Adam optimizers. We extensively evaluate LRTuner on multiple datasets, models, and across optimizers. We compare favorably against standard learning rate schedules for the given dataset and models, including ImageNet on Resnet-50, Cifar-10 on Resnet-18, and SQuAD fine-tuning on BERT. For example on ImageNet with Resnet-50, LRTuner shows up to 0.2% absolute gains in test accuracy compared to the hand-tuned baseline schedule. Moreover, LRTuner can achieve the same accuracy as the baseline schedule in 29% less optimization steps.

* 17 pages

Via

Access Paper or Ask Questions

Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule

Mar 09, 2020

Nikhil Iyer, V Thejas, Nipun Kwatra, Ramachandran Ramjee, Muthian Sivathanu

Figure 1 for Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule

Figure 2 for Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule

Figure 3 for Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule

Figure 4 for Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule

Abstract:While the generalization properties of neural networks are not yet well understood, several papers argue that wide minima generalize better than narrow minima. In this paper, through detailed experiments that not only corroborate the generalization properties of wide minima, we also provide empirical evidence for a new hypothesis that the density of wide minima is likely lower than the density of narrow minima. Further, motivated by this hypothesis, we design a novel explore-exploit learning rate schedule. On a variety of image and natural language datasets, compared to their original hand-tuned learning rate baselines, we show that our explore-exploit schedule can result in either up to 0.5\% higher absolute accuracy using the original training budget or up to 44\% reduced training time while achieving the original reported accuracy.

Via

Access Paper or Ask Questions