Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Naoki Sato

Analysis of Muon's Convergence and Critical Batch Size

Jul 02, 2025

Naoki Sato, Hiroki Naganuma, Hideaki Iiduka

Abstract:This paper presents a theoretical analysis of Muon, a new optimizer that leverages the inherent matrix structure of neural network parameters. We provide convergence proofs for four practical variants of Muon: with and without Nesterov momentum, and with and without weight decay. We then show that adding weight decay leads to strictly tighter bounds on both the parameter and gradient norms, and we clarify the relationship between the weight decay coefficient and the learning rate. Finally, we derive Muon's critical batch size minimizing the stochastic first-order oracle (SFO) complexity, which is the stochastic computational cost, and validate our theoretical findings with experiments.

Via

Access Paper or Ask Questions

Scaled Conjugate Gradient Method for Nonconvex Optimization in Deep Neural Networks

Dec 16, 2024

Naoki Sato, Koshiro Izumi, Hideaki Iiduka

Abstract:A scaled conjugate gradient method that accelerates existing adaptive methods utilizing stochastic gradients is proposed for solving nonconvex optimization problems with deep neural networks. It is shown theoretically that, whether with constant or diminishing learning rates, the proposed method can obtain a stationary point of the problem. Additionally, its rate of convergence with diminishing learning rates is verified to be superior to that of the conjugate gradient method. The proposed method is shown to minimize training loss functions faster than the existing adaptive methods in practical applications of image and text classification. Furthermore, in the training of generative adversarial networks, one version of the proposed method achieved the lowest Frechet inception distance score among those of the adaptive methods.

* Accepted at JMLR (Dec. 2024)

Via

Access Paper or Ask Questions

Explicit and Implicit Graduated Optimization in Deep Neural Networks

Dec 16, 2024

Naoki Sato, Hideaki Iiduka

Figure 1 for Explicit and Implicit Graduated Optimization in Deep Neural Networks

Figure 2 for Explicit and Implicit Graduated Optimization in Deep Neural Networks

Figure 3 for Explicit and Implicit Graduated Optimization in Deep Neural Networks

Figure 4 for Explicit and Implicit Graduated Optimization in Deep Neural Networks

Abstract:Graduated optimization is a global optimization technique that is used to minimize a multimodal nonconvex function by smoothing the objective function with noise and gradually refining the solution. This paper experimentally evaluates the performance of the explicit graduated optimization algorithm with an optimal noise scheduling derived from a previous study and discusses its limitations. It uses traditional benchmark functions and empirical loss functions for modern neural network architectures for evaluating. In addition, this paper extends the implicit graduated optimization algorithm, which is based on the fact that stochastic noise in the optimization process of SGD implicitly smooths the objective function, to SGD with momentum, analyzes its convergence, and demonstrates its effectiveness through experiments on image classification tasks with ResNet architectures.

* Accepted at AAAI-25

Via

Access Paper or Ask Questions

Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization

Feb 04, 2024

Naoki Sato, Hideaki Iiduka

Figure 1 for Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization

Figure 2 for Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization

Figure 3 for Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization

Figure 4 for Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization

Abstract:While stochastic gradient descent (SGD) with momentum has fast convergence and excellent generalizability, a theoretical explanation for this is lacking. In this paper, we show that SGD with momentum smooths the objective function, the degree of which is determined by the learning rate, the batch size, the momentum factor, the variance of the stochastic gradient, and the upper bound of the gradient norm. This theoretical finding reveals why momentum improves generalizability and provides new insights into the role of the hyperparameters, including momentum factor. We also present an implicit graduated optimization algorithm that exploits the smoothing properties of SGD with momentum and provide experimental results supporting our assertion that SGD with momentum smooths the objective function.

Via

Access Paper or Ask Questions

Using Stochastic Gradient Descent to Smooth Nonconvex Functions: Analysis of Implicit Graduated Optimization with Optimal Noise Scheduling

Nov 29, 2023

Naoki Sato, Hideaki Iiduka

Abstract:The graduated optimization approach is a heuristic method for finding globally optimal solutions for nonconvex functions and has been theoretically analyzed in several studies. This paper defines a new family of nonconvex functions for graduated optimization, discusses their sufficient conditions, and provides a convergence analysis of the graduated optimization algorithm for them. It shows that stochastic gradient descent (SGD) with mini-batch stochastic gradients has the effect of smoothing the function, the degree of which is determined by the learning rate and batch size. This finding provides theoretical insights on why large batch sizes fall into sharp local minima, why decaying learning rates and increasing batch sizes are superior to fixed learning rates and batch sizes, and what the optimal learning rate scheduling is. To the best of our knowledge, this is the first paper to provide a theoretical explanation for these aspects. Moreover, a new graduated optimization framework that uses a decaying learning rate and increasing batch size is analyzed and experimental results of image classification that support our theoretical findings are reported.

* The latest version was updated on Nov. 29

Via

Access Paper or Ask Questions

Using Constant Learning Rate of Two Time-Scale Update Rule for Training Generative Adversarial Networks

Jan 28, 2022

Naoki Sato, Hideaki Iiduka

Abstract:Previous numerical results have shown that a two time-scale update rule (TTUR) using constant learning rates is practically useful for training generative adversarial networks (GANs). Meanwhile, a theoretical analysis of TTUR to find a stationary local Nash equilibrium of a Nash equilibrium problem with two players, a discriminator and a generator, has been given using decaying learning rates. In this paper, we give a theoretical analysis of TTUR using constant learning rates to bridge the gap between theory and practice. In particular, we show that, for TTUR using constant learning rates, the number of steps needed to find a stationary local Nash equilibrium decreases as the batch size increases. We also provide numerical results to support our theoretical analyzes.

Via

Access Paper or Ask Questions