Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Noah Golmant

On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

Nov 30, 2018

Noah Golmant, Nikita Vemuri, Zhewei Yao, Vladimir Feinberg, Amir Gholami, Kai Rothauge, Michael W. Mahoney, Joseph Gonzalez

Figure 1 for On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

Figure 2 for On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

Figure 3 for On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

Figure 4 for On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

Abstract:Increasing the mini-batch size for stochastic gradient descent offers significant opportunities to reduce wall-clock training time, but there are a variety of theoretical and systems challenges that impede the widespread success of this technique. We investigate these issues, with an emphasis on time to convergence and total computational cost, through an extensive empirical analysis of network training across several architectures and problem domains, including image classification, image segmentation, and language modeling. Although it is common practice to increase the batch size in order to fully exploit available computational resources, we find a substantially more nuanced picture. Our main finding is that across a wide range of network architectures and problem domains, increasing the batch size beyond a certain point yields no decrease in wall-clock time to convergence for \emph{either} train or test loss. This batch size is usually substantially below the capacity of current systems. We show that popular training strategies for large batch size optimization begin to fail before we can populate all available compute resources, and we show that the point at which these methods break down depends more on attributes like model architecture and data complexity than it does directly on the size of the dataset.

Via

Access Paper or Ask Questions

Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions

Dec 03, 2017

Bichen Wu, Alvin Wan, Xiangyu Yue, Peter Jin, Sicheng Zhao, Noah Golmant, Amir Gholaminejad, Joseph Gonzalez, Kurt Keutzer

Figure 1 for Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions

Figure 2 for Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions

Figure 3 for Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions

Figure 4 for Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions

Abstract:Neural networks rely on convolutions to aggregate spatial information. However, spatial convolutions are expensive in terms of model size and computation, both of which grow quadratically with respect to kernel size. In this paper, we present a parameter-free, FLOP-free "shift" operation as an alternative to spatial convolutions. We fuse shifts and point-wise convolutions to construct end-to-end trainable shift-based modules, with a hyperparameter characterizing the tradeoff between accuracy and efficiency. To demonstrate the operation's efficacy, we replace ResNet's 3x3 convolutions with shift-based modules for improved CIFAR10 and CIFAR100 accuracy using 60% fewer parameters; we additionally demonstrate the operation's resilience to parameter reduction on ImageNet, outperforming ResNet family members. We finally show the shift operation's applicability across domains, achieving strong performance with fewer parameters on classification, face verification and style transfer.

* Source code will be released afterwards

Via

Access Paper or Ask Questions