Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Achraf Bahamou

Layer-wise Adaptive Step-Sizes for Stochastic First-Order Methods for Deep Learning

May 23, 2023

Achraf Bahamou, Donald Goldfarb

Figure 1 for Layer-wise Adaptive Step-Sizes for Stochastic First-Order Methods for Deep Learning

Figure 2 for Layer-wise Adaptive Step-Sizes for Stochastic First-Order Methods for Deep Learning

Figure 3 for Layer-wise Adaptive Step-Sizes for Stochastic First-Order Methods for Deep Learning

Figure 4 for Layer-wise Adaptive Step-Sizes for Stochastic First-Order Methods for Deep Learning

Abstract:We propose a new per-layer adaptive step-size procedure for stochastic first-order optimization methods for minimizing empirical loss functions in deep learning, eliminating the need for the user to tune the learning rate (LR). The proposed approach exploits the layer-wise stochastic curvature information contained in the diagonal blocks of the Hessian in deep neural networks (DNNs) to compute adaptive step-sizes (i.e., LRs) for each layer. The method has memory requirements that are comparable to those of first-order methods, while its per-iteration time complexity is only increased by an amount that is roughly equivalent to an additional gradient computation. Numerical experiments show that SGD with momentum and AdamW combined with the proposed per-layer step-sizes are able to choose effective LR schedules and outperform fine-tuned LR versions of these methods as well as popular first-order and second-order algorithms for training DNNs on Autoencoder, Convolutional Neural Network (CNN) and Graph Convolutional Network (GCN) models. Finally, it is proved that an idealized version of SGD with the layer-wise step sizes converges linearly when using full-batch gradients.

Via

Access Paper or Ask Questions

A Mini-Block Natural Gradient Method for Deep Neural Networks

Feb 16, 2022

Achraf Bahamou, Donald Goldfarb, Yi Ren

Figure 1 for A Mini-Block Natural Gradient Method for Deep Neural Networks

Figure 2 for A Mini-Block Natural Gradient Method for Deep Neural Networks

Figure 3 for A Mini-Block Natural Gradient Method for Deep Neural Networks

Figure 4 for A Mini-Block Natural Gradient Method for Deep Neural Networks

Abstract:The training of deep neural networks (DNNs) is currently predominantly done using first-order methods. Some of these methods (e.g., Adam, AdaGrad, and RMSprop, and their variants) incorporate a small amount of curvature information by using a diagonal matrix to precondition the stochastic gradient. Recently, effective second-order methods, such as KFAC, K-BFGS, Shampoo, and TNT, have been developed for training DNNs, by preconditioning the stochastic gradient by layer-wise block-diagonal matrices. Here we propose and analyze the convergence of an approximate natural gradient method, mini-block Fisher (MBF), that lies in between these two classes of methods. Specifically, our method uses a block-diagonal approximation to the Fisher matrix, where for each layer in the DNN, whether it is convolutional or feed-forward and fully connected, the associated diagonal block is also block-diagonal and is composed of a large number of mini-blocks of modest size. Our novel approach utilizes the parallelism of GPUs to efficiently perform computations on the large number of matrices in each layer. Consequently, MBF's per-iteration computational cost is only slightly higher than it is for first-order methods. Finally, the performance of our proposed method is compared to that of several baseline methods, on both Auto-encoder and CNN problems, to validate its effectiveness both in terms of time efficiency and generalization power.

Via

Access Paper or Ask Questions

Practical Quasi-Newton Methods for Training Deep Neural Networks

Jun 16, 2020

Donald Goldfarb, Yi Ren, Achraf Bahamou

Figure 1 for Practical Quasi-Newton Methods for Training Deep Neural Networks

Figure 2 for Practical Quasi-Newton Methods for Training Deep Neural Networks

Figure 3 for Practical Quasi-Newton Methods for Training Deep Neural Networks

Figure 4 for Practical Quasi-Newton Methods for Training Deep Neural Networks

Abstract:We consider the development of practical stochastic quasi-Newton, and in particular Kronecker-factored block-diagonal BFGS and L-BFGS methods, for training deep neural networks (DNNs). In DNN training, the number of variables and components of the gradient $n$ is often of the order of tens of millions and the Hessian has $n^2$ elements. Consequently, computing and storing a full $n \times n$ BFGS approximation or storing a modest number of (step, change in gradient) vector pairs for use in an L-BFGS implementation is out of the question. In our proposed methods, we approximate the Hessian by a block-diagonal matrix and use the structure of the gradient and Hessian to further approximate these blocks, each of which corresponds to a layer, as the Kronecker product of two much smaller matrices. This is analogous to the approach in KFAC, which computes a Kronecker-factored block-diagonal approximation to the Fisher matrix in a stochastic natural gradient method. Because the indefinite and highly variable nature of the Hessian in a DNN, we also propose a new damping approach to keep the upper as well as the lower bounds of the BFGS and L-BFGS approximations bounded. In tests on autoencoder feed-forward neural network models with either nine or thirteen layers applied to three datasets, our methods outperformed or performed comparably to KFAC and state-of-the-art first-order stochastic methods.

Via

Access Paper or Ask Questions

Stochastic Flows and Geometric Optimization on the Orthogonal Group

Mar 30, 2020

Krzysztof Choromanski, David Cheikhi, Jared Davis, Valerii Likhosherstov, Achille Nazaret, Achraf Bahamou, Xingyou Song, Mrugank Akarte, Jack Parker-Holder, Jacob Bergquist(+5 more)

Figure 1 for Stochastic Flows and Geometric Optimization on the Orthogonal Group

Figure 2 for Stochastic Flows and Geometric Optimization on the Orthogonal Group

Figure 3 for Stochastic Flows and Geometric Optimization on the Orthogonal Group

Figure 4 for Stochastic Flows and Geometric Optimization on the Orthogonal Group

Abstract:We present a new class of stochastic, geometrically-driven optimization algorithms on the orthogonal group $O(d)$ and naturally reductive homogeneous manifolds obtained from the action of the rotation group $SO(d)$. We theoretically and experimentally demonstrate that our methods can be applied in various fields of machine learning including deep, convolutional and recurrent neural networks, reinforcement learning, normalizing flows and metric learning. We show an intriguing connection between efficient stochastic optimization on the orthogonal group and graph theory (e.g. matching problem, partition functions over graphs, graph-coloring). We leverage the theory of Lie groups and provide theoretical results for the designed class of algorithms. We demonstrate broad applicability of our methods by showing strong performance on the seemingly unrelated tasks of learning world models to obtain stable policies for the most difficult $\mathrm{Humanoid}$ agent from $\mathrm{OpenAI}$ $\mathrm{Gym}$ and improving convolutional neural networks.

Via

Access Paper or Ask Questions

A Dynamic Sampling Adaptive-SGD Method for Machine Learning

Dec 31, 2019

Achraf Bahamou, Donald Goldfarb

Figure 1 for A Dynamic Sampling Adaptive-SGD Method for Machine Learning

Figure 2 for A Dynamic Sampling Adaptive-SGD Method for Machine Learning

Figure 3 for A Dynamic Sampling Adaptive-SGD Method for Machine Learning

Figure 4 for A Dynamic Sampling Adaptive-SGD Method for Machine Learning

Abstract:We propose a stochastic optimization method for minimizing loss functions, which can be expressed as an expected value, that adaptively controls the batch size used in the computation of gradient approximations and the step size used to move along such directions, eliminating the need for the user to tune the learning rate. The proposed method exploits local curvature information and ensures that search directions are descent directions with high probability using an acute-angle test. The method is proved to have, under reasonable assumptions, a global linear rate of convergence on self-concordant functions with high probability. Numerical experiments show that this method is able to choose the best learning rates and compares favorably to fine-tuned SGD for training logistic regression and Deep Neural Networks (DNNs). We also propose an adaptive version of ADAM that eliminates the need to tune the base learning rate and compares favorably to fine-tuned ADAM for training DNNs.

Via

Access Paper or Ask Questions