Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yohai Bar-Sinai

Grokking at the Edge of Linear Separability

Oct 06, 2024

Alon Beck, Noam Levi, Yohai Bar-Sinai

Figure 1 for Grokking at the Edge of Linear Separability

Figure 2 for Grokking at the Edge of Linear Separability

Figure 3 for Grokking at the Edge of Linear Separability

Figure 4 for Grokking at the Edge of Linear Separability

Abstract:We study the generalization properties of binary logistic classification in a simplified setting, for which a "memorizing" and "generalizing" solution can always be strictly defined, and elucidate empirically and analytically the mechanism underlying Grokking in its dynamics. We analyze the asymptotic long-time dynamics of logistic classification on a random feature model with a constant label and show that it exhibits Grokking, in the sense of delayed generalization and non-monotonic test loss. We find that Grokking is amplified when classification is applied to training sets which are on the verge of linear separability. Even though a perfect generalizing solution always exists, we prove the implicit bias of the logisitc loss will cause the model to overfit if the training data is linearly separable from the origin. For training sets that are not separable from the origin, the model will always generalize perfectly asymptotically, but overfitting may occur at early stages of training. Importantly, in the vicinity of the transition, that is, for training sets that are almost separable from the origin, the model may overfit for arbitrarily long times before generalizing. We gain more insights by examining a tractable one-dimensional toy model that quantitatively captures the key features of the full model. Finally, we highlight intriguing common properties of our findings with recent literature, suggesting that grokking generally occurs in proximity to the interpolation threshold, reminiscent of critical phenomena often observed in physical systems.

* 24 pages, 13 figures

Via

Access Paper or Ask Questions

Grokking in Linear Estimators -- A Solvable Model that Groks without Understanding

Oct 25, 2023

Noam Levi, Alon Beck, Yohai Bar-Sinai

Figure 1 for Grokking in Linear Estimators -- A Solvable Model that Groks without Understanding

Figure 2 for Grokking in Linear Estimators -- A Solvable Model that Groks without Understanding

Figure 3 for Grokking in Linear Estimators -- A Solvable Model that Groks without Understanding

Figure 4 for Grokking in Linear Estimators -- A Solvable Model that Groks without Understanding

Abstract:Grokking is the intriguing phenomenon where a model learns to generalize long after it has fit the training data. We show both analytically and numerically that grokking can surprisingly occur in linear networks performing linear tasks in a simple teacher-student setup with Gaussian inputs. In this setting, the full training dynamics is derived in terms of the training and generalization data covariance matrix. We present exact predictions on how the grokking time depends on input and output dimensionality, train sample size, regularization, and network initialization. We demonstrate that the sharp increase in generalization accuracy may not imply a transition from "memorization" to "understanding", but can simply be an artifact of the accuracy measure. We provide empirical verification for our calculations, along with preliminary results indicating that some predictions also hold for deeper networks, with non-linear activations.

* 17 pages, 6 figures

Via

Access Paper or Ask Questions

Charting the Topography of the Neural Network Landscape with Thermal-Like Noise

Apr 18, 2023

Theo Jules, Gal Brener, Tal Kachman, Noam Levi, Yohai Bar-Sinai

Figure 1 for Charting the Topography of the Neural Network Landscape with Thermal-Like Noise

Figure 2 for Charting the Topography of the Neural Network Landscape with Thermal-Like Noise

Figure 3 for Charting the Topography of the Neural Network Landscape with Thermal-Like Noise

Figure 4 for Charting the Topography of the Neural Network Landscape with Thermal-Like Noise

Abstract:The training of neural networks is a complex, high-dimensional, non-convex and noisy optimization problem whose theoretical understanding is interesting both from an applicative perspective and for fundamental reasons. A core challenge is to understand the geometry and topography of the landscape that guides the optimization. In this work, we employ standard Statistical Mechanics methods, namely, phase-space exploration using Langevin dynamics, to study this landscape for an over-parameterized fully connected network performing a classification task on random data. Analyzing the fluctuation statistics, in analogy to thermal dynamics at a constant temperature, we infer a clear geometric description of the low-loss region. We find that it is a low-dimensional manifold whose dimension can be readily obtained from the fluctuations. Furthermore, this dimension is controlled by the number of data points that reside near the classification decision boundary. Importantly, we find that a quadratic approximation of the loss near the minimum is fundamentally inadequate due to the exponential nature of the decision boundary and the flatness of the low-loss region. This causes the dynamics to sample regions with higher curvature at higher temperatures, while producing quadratic-like statistics at any given temperature. We explain this behavior by a simplified loss model which is analytically tractable and reproduces the observed fluctuation statistics.

* 7 pages, 4 figures

Via

Access Paper or Ask Questions