Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dmitrii Podoprikhin

Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs

Oct 30, 2018

Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov, Andrew Gordon Wilson

Figure 1 for Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs

Figure 2 for Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs

Figure 3 for Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs

Figure 4 for Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs

Abstract:The loss functions of deep neural networks are complex and their geometric properties are not well understood. We show that the optima of these complex loss functions are in fact connected by simple curves over which training and test accuracy are nearly constant. We introduce a training procedure to discover these high-accuracy pathways between modes. Inspired by this new geometric insight, we also propose a new ensembling method entitled Fast Geometric Ensembling (FGE). Using FGE we can train high-performing ensembles in the time required to train a single model. We achieve improved performance compared to the recent state-of-the-art Snapshot Ensembles, on CIFAR-10, CIFAR-100, and ImageNet.

* Appears at Advances in Neural Information Processing Systems (NIPS), 2018

Via

Access Paper or Ask Questions

Averaging Weights Leads to Wider Optima and Better Generalization

Aug 08, 2018

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson

Figure 1 for Averaging Weights Leads to Wider Optima and Better Generalization

Figure 2 for Averaging Weights Leads to Wider Optima and Better Generalization

Figure 3 for Averaging Weights Leads to Wider Optima and Better Generalization

Figure 4 for Averaging Weights Leads to Wider Optima and Better Generalization

Abstract:Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much broader optima than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model. Using SWA we achieve notable improvement in test accuracy over conventional SGD training on a range of state-of-the-art residual networks, PyramidNets, DenseNets, and Shake-Shake networks on CIFAR-10, CIFAR-100, and ImageNet. In short, SWA is extremely easy to implement, improves generalization, and has almost no computational overhead.

* Appears at the Conference on Uncertainty in Artificial Intelligence (UAI), 2018

Via

Access Paper or Ask Questions