Abstract:While stochastic gradient descent (SGD) with momentum has fast convergence and excellent generalizability, a theoretical explanation for this is lacking. In this paper, we show that SGD with momentum smooths the objective function, the degree of which is determined by the learning rate, the batch size, the momentum factor, the variance of the stochastic gradient, and the upper bound of the gradient norm. This theoretical finding reveals why momentum improves generalizability and provides new insights into the role of the hyperparameters, including momentum factor. We also present an implicit graduated optimization algorithm that exploits the smoothing properties of SGD with momentum and provide experimental results supporting our assertion that SGD with momentum smooths the objective function.
Abstract:The graduated optimization approach is a heuristic method for finding globally optimal solutions for nonconvex functions and has been theoretically analyzed in several studies. This paper defines a new family of nonconvex functions for graduated optimization, discusses their sufficient conditions, and provides a convergence analysis of the graduated optimization algorithm for them. It shows that stochastic gradient descent (SGD) with mini-batch stochastic gradients has the effect of smoothing the function, the degree of which is determined by the learning rate and batch size. This finding provides theoretical insights on why large batch sizes fall into sharp local minima, why decaying learning rates and increasing batch sizes are superior to fixed learning rates and batch sizes, and what the optimal learning rate scheduling is. To the best of our knowledge, this is the first paper to provide a theoretical explanation for these aspects. Moreover, a new graduated optimization framework that uses a decaying learning rate and increasing batch size is analyzed and experimental results of image classification that support our theoretical findings are reported.
Abstract:Previous numerical results have shown that a two time-scale update rule (TTUR) using constant learning rates is practically useful for training generative adversarial networks (GANs). Meanwhile, a theoretical analysis of TTUR to find a stationary local Nash equilibrium of a Nash equilibrium problem with two players, a discriminator and a generator, has been given using decaying learning rates. In this paper, we give a theoretical analysis of TTUR using constant learning rates to bridge the gap between theory and practice. In particular, we show that, for TTUR using constant learning rates, the number of steps needed to find a stationary local Nash equilibrium decreases as the batch size increases. We also provide numerical results to support our theoretical analyzes.