Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Damien Scieur

DMA, CIMS

Understanding Adam Requires Better Rotation Dependent Assumptions

Oct 25, 2024

Lucas Maes, Tianyue H. Zhang, Alexia Jolicoeur-Martineau, Ioannis Mitliagkas, Damien Scieur, Simon Lacoste-Julien, Charles Guille-Escuret

Figure 1 for Understanding Adam Requires Better Rotation Dependent Assumptions

Figure 2 for Understanding Adam Requires Better Rotation Dependent Assumptions

Figure 3 for Understanding Adam Requires Better Rotation Dependent Assumptions

Figure 4 for Understanding Adam Requires Better Rotation Dependent Assumptions

Abstract:Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent (SGD) lacks a comprehensive theoretical explanation. This paper investigates Adam's sensitivity to rotations of the parameter space. We demonstrate that Adam's performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam's advantages theoretically. To better understand the rotation-dependent properties that benefit Adam, we also identify structured rotations that preserve or even enhance its empirical performance. We then examine the rotation-dependent assumptions in the literature, evaluating their adequacy in explaining Adam's behavior across various rotation types. This work highlights the need for new, rotation-dependent theoretical frameworks to fully understand Adam's empirical success in modern machine learning tasks.

Via

Access Paper or Ask Questions

SING: A Plug-and-Play DNN Learning Technique

May 25, 2023

Adrien Courtois, Damien Scieur, Jean-Michel Morel, Pablo Arias, Thomas Eboli

Abstract:We propose SING (StabIlized and Normalized Gradient), a plug-and-play technique that improves the stability and generalization of the Adam(W) optimizer. SING is straightforward to implement and has minimal computational overhead, requiring only a layer-wise standardization of the gradients fed to Adam(W) without introducing additional hyper-parameters. We support the effectiveness and practicality of the proposed approach by showing improved results on a wide range of architectures, problems (such as image classification, depth estimation, and natural language processing), and in combination with other optimizers. We provide a theoretical analysis of the convergence of the method, and we show that by virtue of the standardization, SING can escape local minima narrower than a threshold that is inversely proportional to the network's depth.

Via

Access Paper or Ask Questions

The Curse of Unrolling: Rate of Differentiating Through Optimization

Sep 27, 2022

Damien Scieur, Quentin Bertrand, Gauthier Gidel, Fabian Pedregosa

Figure 1 for The Curse of Unrolling: Rate of Differentiating Through Optimization

Figure 2 for The Curse of Unrolling: Rate of Differentiating Through Optimization

Figure 3 for The Curse of Unrolling: Rate of Differentiating Through Optimization

Figure 4 for The Curse of Unrolling: Rate of Differentiating Through Optimization

Abstract:Computing the Jacobian of the solution of an optimization problem is a central problem in machine learning, with applications in hyperparameter optimization, meta-learning, optimization as a layer, and dataset distillation, to name a few. Unrolled differentiation is a popular heuristic that approximates the solution using an iterative solver and differentiates it through the computational path. This work provides a non-asymptotic convergence-rate analysis of this approach on quadratic objectives for gradient descent and the Chebyshev method. We show that to ensure convergence of the Jacobian, we can either 1) choose a large learning rate leading to a fast asymptotic convergence but accept that the algorithm may have an arbitrarily long burn-in phase or 2) choose a smaller learning rate leading to an immediate but slower convergence. We refer to this phenomenon as the curse of unrolling. Finally, we discuss open problems relative to this approach, such as deriving a practical update rule for the optimal unrolling strategy and making novel connections with the field of Sobolev orthogonal polynomials.

Via

Access Paper or Ask Questions

Only Tails Matter: Average-Case Universality and Robustness in the Convex Regime

Jun 22, 2022

Leonardo Cunha, Gauthier Gidel, Fabian Pedregosa, Damien Scieur, Courtney Paquette

Figure 1 for Only Tails Matter: Average-Case Universality and Robustness in the Convex Regime

Figure 2 for Only Tails Matter: Average-Case Universality and Robustness in the Convex Regime

Figure 3 for Only Tails Matter: Average-Case Universality and Robustness in the Convex Regime

Figure 4 for Only Tails Matter: Average-Case Universality and Robustness in the Convex Regime

Abstract:The recently developed average-case analysis of optimization methods allows a more fine-grained and representative convergence analysis than usual worst-case results. In exchange, this analysis requires a more precise hypothesis over the data generating process, namely assuming knowledge of the expected spectral distribution (ESD) of the random matrix associated with the problem. This work shows that the concentration of eigenvalues near the edges of the ESD determines a problem's asymptotic average complexity. This a priori information on this concentration is a more grounded assumption than complete knowledge of the ESD. This approximate concentration is effectively a middle ground between the coarseness of the worst-case scenario convergence and the restrictive previous average-case analysis. We also introduce the Generalized Chebyshev method, asymptotically optimal under a hypothesis on this concentration and globally optimal when the ESD follows a Beta distribution. We compare its performance to classical optimization algorithms, such as gradient descent or Nesterov's scheme, and we show that, in the average-case context, Nesterov's method is universally nearly optimal asymptotically.

* To be published in ICML 2022

Via

Access Paper or Ask Questions

Convergence Rates for the MAP of an Exponential Family and Stochastic Mirror Descent -- an Open Problem

Nov 12, 2021

Rémi Le Priol, Frederik Kunstner, Damien Scieur, Simon Lacoste-Julien

Figure 1 for Convergence Rates for the MAP of an Exponential Family and Stochastic Mirror Descent -- an Open Problem

Figure 2 for Convergence Rates for the MAP of an Exponential Family and Stochastic Mirror Descent -- an Open Problem

Figure 3 for Convergence Rates for the MAP of an Exponential Family and Stochastic Mirror Descent -- an Open Problem

Figure 4 for Convergence Rates for the MAP of an Exponential Family and Stochastic Mirror Descent -- an Open Problem

Abstract:We consider the problem of upper bounding the expected log-likelihood sub-optimality of the maximum likelihood estimate (MLE), or a conjugate maximum a posteriori (MAP) for an exponential family, in a non-asymptotic way. Surprisingly, we found no general solution to this problem in the literature. In particular, current theories do not hold for a Gaussian or in the interesting few samples regime. After exhibiting various facets of the problem, we show we can interpret the MAP as running stochastic mirror descent (SMD) on the log-likelihood. However, modern convergence results do not apply for standard examples of the exponential family, highlighting holes in the convergence literature. We believe solving this very fundamental problem may bring progress to both the statistics and optimization communities.

* 9 pages and 3 figures + Appendix

Via

Access Paper or Ask Questions

Connecting Sphere Manifolds Hierarchically for Regularization

Jun 25, 2021

Damien Scieur, Youngsung Kim

Figure 1 for Connecting Sphere Manifolds Hierarchically for Regularization

Figure 2 for Connecting Sphere Manifolds Hierarchically for Regularization

Figure 3 for Connecting Sphere Manifolds Hierarchically for Regularization

Figure 4 for Connecting Sphere Manifolds Hierarchically for Regularization

Abstract:This paper considers classification problems with hierarchically organized classes. We force the classifier (hyperplane) of each class to belong to a sphere manifold, whose center is the classifier of its super-class. Then, individual sphere manifolds are connected based on their hierarchical relations. Our technique replaces the last layer of a neural network by combining a spherical fully-connected layer with a hierarchical layer. This regularization is shown to improve the performance of widely used deep neural network architectures (ResNet and DenseNet) on publicly available datasets (CIFAR100, CUB200, Stanford dogs, Stanford cars, and Tiny-ImageNet).

Via

Access Paper or Ask Questions

Acceleration Methods

Jan 23, 2021

Alexandre d'Aspremont, Damien Scieur, Adrien Taylor

Abstract:This monograph covers some recent advances on a range of acceleration techniques frequently used in convex optimization. We first use quadratic optimization problems to introduce two key families of methods, momentum and nested optimization schemes, which coincide in the quadratic case to form the Chebyshev method whose complexity is analyzed using Chebyshev polynomials. We discuss momentum methods in detail, starting with the seminal work of Nesterov (1983) and structure convergence proofs using a few master templates, such as that of \emph{optimized gradient methods} which have the key benefit of showing how momentum methods maximize convergence rates. We further cover proximal acceleration techniques, at the heart of the \emph{Catalyst} and \emph{Accelerated Hybrid Proximal Extragradient} frameworks, using similar algorithmic patterns. Common acceleration techniques directly rely on the knowledge of some regularity parameters of the problem at hand, and we conclude by discussing \emph{restart} schemes, a set of simple techniques to reach nearly optimal convergence rates while adapting to unobserved regularity parameters.

Via

Access Paper or Ask Questions

Average-case Acceleration for Bilinear Games and Normal Matrices

Oct 05, 2020

Carles Domingo-Enrich, Fabian Pedregosa, Damien Scieur

Figure 1 for Average-case Acceleration for Bilinear Games and Normal Matrices

Abstract:Advances in generative modeling and adversarial learning have given rise to renewed interest in smooth games. However, the absence of symmetry in the matrix of second derivatives poses challenges that are not present in the classical minimization framework. While a rich theory of average-case analysis has been developed for minimization problems, little is known in the context of smooth games. In this work we take a first step towards closing this gap by developing average-case optimal first-order methods for a subset of smooth games. We make the following three main contributions. First, we show that for zero-sum bilinear games the average-case optimal method is the optimal method for the minimization of the Hamiltonian. Second, we provide an explicit expression for the optimal method corresponding to normal matrices, potentially non-symmetric. Finally, we specialize it to matrices with eigenvalues located in a disk and show a provable speed-up compared to worst-case optimal algorithms. We illustrate our findings through benchmarks with a varying degree of mismatch with our assumptions.

* 24 pages, 1 figure

Via

Access Paper or Ask Questions

Average-case Acceleration Through Spectral Density Estimation

Feb 13, 2020

Fabian Pedregosa, Damien Scieur

Figure 1 for Average-case Acceleration Through Spectral Density Estimation

Figure 2 for Average-case Acceleration Through Spectral Density Estimation

Figure 3 for Average-case Acceleration Through Spectral Density Estimation

Figure 4 for Average-case Acceleration Through Spectral Density Estimation

Abstract:We develop a framework for designing optimal quadratic optimization methods in terms of their average-case runtime. This yields a new class of methods that achieve acceleration through a model of the Hessian's expected spectral density. We develop explicit algorithms for the uniform, Marchenko-Pastur, and exponential distributions. These methods are momentum-based gradient algorithms whose hyper-parameters can be estimated without knowledge of the Hessian's smallest singular value, in contrast with classical accelerated methods like Nesterov acceleration and Polyak momentum. Empirical results on quadratic, logistic regression and neural networks show the proposed methods always match and in many cases significantly improve over classical accelerated methods.

Via

Access Paper or Ask Questions

Accelerating Smooth Games by Manipulating Spectral Shapes

Jan 02, 2020

Waïss Azizian, Damien Scieur, Ioannis Mitliagkas, Simon Lacoste-Julien, Gauthier Gidel

Figure 1 for Accelerating Smooth Games by Manipulating Spectral Shapes

Figure 2 for Accelerating Smooth Games by Manipulating Spectral Shapes

Figure 3 for Accelerating Smooth Games by Manipulating Spectral Shapes

Abstract:We use matrix iteration theory to characterize acceleration in smooth games. We define the spectral shape of a family of games as the set containing all eigenvalues of the Jacobians of standard gradient dynamics in the family. Shapes restricted to the real line represent well-understood classes of problems, like minimization. Shapes spanning the complex plane capture the added numerical challenges in solving smooth games. In this framework, we describe gradient-based methods, such as extragradient, as transformations on the spectral shape. Using this perspective, we propose an optimal algorithm for bilinear games. For smooth and strongly monotone operators, we identify a continuum between convex minimization, where acceleration is possible using Polyak's momentum, and the worst case where gradient descent is optimal. Finally, going beyond first-order methods, we propose an accelerated version of consensus optimization.

Via

Access Paper or Ask Questions