Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Scott Pesme

A Theoretical Framework for Grokking: Interpolation followed by Riemannian Norm Minimisation

May 26, 2025

Etienne Boursier, Scott Pesme, Radu-Alexandru Dragomir

Abstract:We study the dynamics of gradient flow with small weight decay on general training losses $F: \mathbb{R}^d \to \mathbb{R}$. Under mild regularity assumptions and assuming convergence of the unregularised gradient flow, we show that the trajectory with weight decay $\lambda$ exhibits a two-phase behaviour as $\lambda \to 0$. During the initial fast phase, the trajectory follows the unregularised gradient flow and converges to a manifold of critical points of $F$. Then, at time of order $1/\lambda$, the trajectory enters a slow drift phase and follows a Riemannian gradient flow minimising the $\ell_2$-norm of the parameters. This purely optimisation-based phenomenon offers a natural explanation for the \textit{grokking} effect observed in deep learning, where the training loss rapidly reaches zero while the test loss plateaus for an extended period before suddenly improving. We argue that this generalisation jump can be attributed to the slow norm reduction induced by weight decay, as explained by our analysis. We validate this mechanism empirically on several synthetic regression tasks.

Via

Access Paper or Ask Questions

Implicit Bias of Mirror Flow on Separable Data

Jun 18, 2024

Scott Pesme, Radu-Alexandru Dragomir, Nicolas Flammarion

Figure 1 for Implicit Bias of Mirror Flow on Separable Data

Figure 2 for Implicit Bias of Mirror Flow on Separable Data

Figure 3 for Implicit Bias of Mirror Flow on Separable Data

Abstract:We examine the continuous-time counterpart of mirror descent, namely mirror flow, on classification problems which are linearly separable. Such problems are minimised `at infinity' and have many possible solutions; we study which solution is preferred by the algorithm depending on the mirror potential. For exponential tailed losses and under mild assumptions on the potential, we show that the iterates converge in direction towards a $\phi_\infty$-maximum margin classifier. The function $\phi_\infty$ is the $\textit{horizon function}$ of the mirror potential and characterises its shape `at infinity'. When the potential is separable, a simple formula allows to compute this function. We analyse several examples of potentials and provide numerical experiments highlighting our results.

Via

Access Paper or Ask Questions

Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks

Mar 08, 2024

Hristo Papazov, Scott Pesme, Nicolas Flammarion

Abstract:In this work, we investigate the effect of momentum on the optimisation trajectory of gradient descent. We leverage a continuous-time approach in the analysis of momentum gradient descent with step size $\gamma$ and momentum parameter $\beta$ that allows us to identify an intrinsic quantity $\lambda = \frac{ \gamma }{ (1 - \beta)^2 }$ which uniquely defines the optimisation path and provides a simple acceleration rule. When training a $2$-layer diagonal linear network in an overparametrised regression setting, we characterise the recovered solution through an implicit regularisation problem. We then prove that small values of $\lambda$ help to recover sparse solutions. Finally, we give similar but weaker results for stochastic momentum gradient descent. We provide numerical experiments which support our claims.

Via

Access Paper or Ask Questions

Saddle-to-Saddle Dynamics in Diagonal Linear Networks

Apr 02, 2023

Scott Pesme, Nicolas Flammarion

Abstract:In this paper we fully describe the trajectory of gradient flow over diagonal linear networks in the limit of vanishing initialisation. We show that the limiting flow successively jumps from a saddle of the training loss to another until reaching the minimum $\ell_1$-norm solution. This saddle-to-saddle dynamics translates to an incremental learning process as each saddle corresponds to the minimiser of the loss constrained to an active set outside of which the coordinates must be zero. We explicitly characterise the visited saddles as well as the jumping times through a recursive algorithm reminiscent of the Homotopy algorithm used for computing the Lasso path. Our proof leverages a convenient arc-length time-reparametrisation which enables to keep track of the heteroclinic transitions between the jumps. Our analysis requires negligible assumptions on the data, applies to both under and overparametrised settings and covers complex cases where there is no monotonicity of the number of active coordinates. We provide numerical experiments to support our findings.

Via

Access Paper or Ask Questions

(S)GD over Diagonal Linear Networks: Implicit Regularisation, Large Stepsizes and Edge of Stability

Feb 17, 2023

Mathieu Even, Scott Pesme, Suriya Gunasekar, Nicolas Flammarion

Abstract:In this paper, we investigate the impact of stochasticity and large stepsizes on the implicit regularisation of gradient descent (GD) and stochastic gradient descent (SGD) over diagonal linear networks. We prove the convergence of GD and SGD with macroscopic stepsizes in an overparametrised regression setting and characterise their solutions through an implicit regularisation problem. Our crisp characterisation leads to qualitative insights about the impact of stochasticity and stepsizes on the recovered solution. Specifically, we show that large stepsizes consistently benefit SGD for sparse regression problems, while they can hinder the recovery of sparse solutions for GD. These effects are magnified for stepsizes in a tight window just below the divergence threshold, in the ``edge of stability'' regime. Our findings are supported by experimental results.

Via

Access Paper or Ask Questions

Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity

Jun 17, 2021

Scott Pesme, Loucas Pillaud-Vivien, Nicolas Flammarion

Figure 1 for Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity

Figure 2 for Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity

Figure 3 for Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity

Figure 4 for Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity

Abstract:Understanding the implicit bias of training algorithms is of crucial importance in order to explain the success of overparametrised neural networks. In this paper, we study the dynamics of stochastic gradient descent over diagonal linear networks through its continuous time version, namely stochastic gradient flow. We explicitly characterise the solution chosen by the stochastic flow and prove that it always enjoys better generalisation properties than that of gradient flow. Quite surprisingly, we show that the convergence speed of the training loss controls the magnitude of the biasing effect: the slower the convergence, the better the bias. To fully complete our analysis, we provide convergence guarantees for the dynamics. We also give experimental results which support our theoretical claims. Our findings highlight the fact that structured noise can induce better generalisation and they help explain the greater performances observed in practice of stochastic gradient descent over gradient descent.

Via

Access Paper or Ask Questions

On Convergence-Diagnostic based Step Sizes for Stochastic Gradient Descent

Jul 01, 2020

Scott Pesme, Aymeric Dieuleveut, Nicolas Flammarion

Figure 1 for On Convergence-Diagnostic based Step Sizes for Stochastic Gradient Descent

Figure 2 for On Convergence-Diagnostic based Step Sizes for Stochastic Gradient Descent

Figure 3 for On Convergence-Diagnostic based Step Sizes for Stochastic Gradient Descent

Figure 4 for On Convergence-Diagnostic based Step Sizes for Stochastic Gradient Descent

Abstract:Constant step-size Stochastic Gradient Descent exhibits two phases: a transient phase during which iterates make fast progress towards the optimum, followed by a stationary phase during which iterates oscillate around the optimal point. In this paper, we show that efficiently detecting this transition and appropriately decreasing the step size can lead to fast convergence rates. We analyse the classical statistical test proposed by Pflug (1983), based on the inner product between consecutive stochastic gradients. Even in the simple case where the objective function is quadratic we show that this test cannot lead to an adequate convergence diagnostic. We then propose a novel and simple statistical procedure that accurately detects stationarity and we provide experimental results showing state-of-the-art performance on synthetic and real-world datasets.

Via

Access Paper or Ask Questions

Online Robust Regression via SGD on the l1 loss

Jul 01, 2020

Scott Pesme, Nicolas Flammarion

Figure 1 for Online Robust Regression via SGD on the l1 loss

Abstract:We consider the robust linear regression problem in the online setting where we have access to the data in a streaming manner, one data point after the other. More specifically, for a true parameter $\theta^*$, we consider the corrupted Gaussian linear model $y = \langle x , \ \theta^* \rangle + \varepsilon + b$ where the adversarial noise $b$ can take any value with probability $\eta$ and equals zero otherwise. We consider this adversary to be oblivious (i.e., $b$ independent of the data) since this is the only contamination model under which consistency is possible. Current algorithms rely on having the whole data at hand in order to identify and remove the outliers. In contrast, we show in this work that stochastic gradient descent on the $\ell_1$ loss converges to the true parameter vector at a $\tilde{O}( 1 / (1 - \eta)^2 n )$ rate which is independent of the values of the contaminated measurements. Our proof relies on the elegant smoothing of the non-smooth $\ell_1$ loss by the Gaussian data and a classical non-asymptotic analysis of Polyak-Ruppert averaged SGD. In addition, we provide experimental evidence of the efficiency of this simple and highly scalable algorithm.

Via

Access Paper or Ask Questions