Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Daniel Kunin

Alternating Gradient Flows: A Theory of Feature Learning in Two-layer Neural Networks

Jun 06, 2025

Daniel Kunin, Giovanni Luca Marchetti, Feng Chen, Dhruva Karkada, James B. Simon, Michael R. DeWeese, Surya Ganguli, Nina Miolane

Abstract:What features neural networks learn, and how, remains an open question. In this paper, we introduce Alternating Gradient Flows (AGF), an algorithmic framework that describes the dynamics of feature learning in two-layer networks trained from small initialization. Prior works have shown that gradient flow in this regime exhibits a staircase-like loss curve, alternating between plateaus where neurons slowly align to useful directions and sharp drops where neurons rapidly grow in norm. AGF approximates this behavior as an alternating two-step process: maximizing a utility function over dormant neurons and minimizing a cost function over active ones. AGF begins with all neurons dormant. At each round, a dormant neuron activates, triggering the acquisition of a feature and a drop in the loss. AGF quantifies the order, timing, and magnitude of these drops, matching experiments across architectures. We show that AGF unifies and extends existing saddle-to-saddle analyses in fully connected linear networks and attention-only linear transformers, where the learned features are singular modes and principal components, respectively. In diagonal linear networks, we prove AGF converges to gradient flow in the limit of vanishing initialization. Applying AGF to quadratic networks trained to perform modular addition, we give the first complete characterization of the training dynamics, revealing that networks learn Fourier features in decreasing order of coefficient magnitude. Altogether, AGF offers a promising step towards understanding feature learning in neural networks.

* 35 pages, 7 figures

Via

Access Paper or Ask Questions

From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks

Sep 22, 2024

Clémentine C. J. Dominé, Nicolas Anguita, Alexandra M. Proca, Lukas Braun, Daniel Kunin, Pedro A. M. Mediano, Andrew M. Saxe

Figure 1 for From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks

Figure 2 for From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks

Figure 3 for From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks

Figure 4 for From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks

Abstract:Biological and artificial neural networks develop internal representations that enable them to perform complex tasks. In artificial networks, the effectiveness of these models relies on their ability to build task specific representation, a process influenced by interactions among datasets, architectures, initialization strategies, and optimization algorithms. Prior studies highlight that different initializations can place networks in either a lazy regime, where representations remain static, or a rich/feature learning regime, where representations evolve dynamically. Here, we examine how initialization influences learning dynamics in deep linear neural networks, deriving exact solutions for lambda-balanced initializations-defined by the relative scale of weights across layers. These solutions capture the evolution of representations and the Neural Tangent Kernel across the spectrum from the rich to the lazy regimes. Our findings deepen the theoretical understanding of the impact of weight initialization on learning regimes, with implications for continual learning, reversal learning, and transfer learning, relevant to both neuroscience and practical applications.

* 10 pages, 8 figures

Via

Access Paper or Ask Questions

Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning

Jun 10, 2024

Daniel Kunin, Allan Raventós, Clémentine Dominé, Feng Chen, David Klindt, Andrew Saxe, Surya Ganguli

Figure 1 for Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning

Figure 2 for Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning

Figure 3 for Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning

Figure 4 for Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning

Abstract:While the impressive performance of modern neural networks is often attributed to their capacity to efficiently extract task-relevant features from data, the mechanisms underlying this rich feature learning regime remain elusive, with much of our theoretical understanding stemming from the opposing lazy regime. In this work, we derive exact solutions to a minimal model that transitions between lazy and rich learning, precisely elucidating how unbalanced layer-specific initialization variances and learning rates determine the degree of feature learning. Our analysis reveals that they conspire to influence the learning regime through a set of conserved quantities that constrain and modify the geometry of learning trajectories in parameter and function space. We extend our analysis to more complex linear models with multiple neurons, outputs, and layers and to shallow nonlinear networks with piecewise linear activation functions. In linear networks, rapid feature learning only occurs with balanced initializations, where all layers learn at similar speeds. While in nonlinear networks, unbalanced initializations that promote faster learning in earlier layers can accelerate rich learning. Through a series of experiments, we provide evidence that this unbalanced rich regime drives feature learning in deep finite-width networks, promotes interpretability of early layers in CNNs, reduces the sample complexity of learning hierarchical data, and decreases the time to grokking in modular arithmetic. Our theory motivates further exploration of unbalanced initializations to enhance efficient feature learning.

* 40 pages, 12 figures

Via

Access Paper or Ask Questions

Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks

Jun 07, 2023

Feng Chen, Daniel Kunin, Atsushi Yamamura, Surya Ganguli

Abstract:In this work, we reveal a strong implicit bias of stochastic gradient descent (SGD) that drives overly expressive networks to much simpler subnetworks, thereby dramatically reducing the number of independent parameters, and improving generalization. To reveal this bias, we identify invariant sets, or subsets of parameter space that remain unmodified by SGD. We focus on two classes of invariant sets that correspond to simpler subnetworks and commonly appear in modern architectures. Our analysis uncovers that SGD exhibits a property of stochastic attractivity towards these simpler invariant sets. We establish a sufficient condition for stochastic attractivity based on a competition between the loss landscape's curvature around the invariant set and the noise introduced by stochastic gradients. Remarkably, we find that an increased level of noise strengthens attractivity, leading to the emergence of attractive invariant sets associated with saddle-points or local maxima of the train loss. We observe empirically the existence of attractive invariant sets in trained deep neural networks, implying that SGD dynamics often collapses to simple subnetworks with either vanishing or redundant neurons. We further demonstrate how this simplifying process of stochastic collapse benefits generalization in a linear teacher-student framework. Finally, through this analysis, we mechanistically explain why early training with large learning rates for extended periods benefits subsequent generalization.

* 30 pages, 10 figures

Via

Access Paper or Ask Questions

The Asymmetric Maximum Margin Bias of Quasi-Homogeneous Neural Networks

Oct 07, 2022

Daniel Kunin, Atsushi Yamamura, Chao Ma, Surya Ganguli

Figure 1 for The Asymmetric Maximum Margin Bias of Quasi-Homogeneous Neural Networks

Figure 2 for The Asymmetric Maximum Margin Bias of Quasi-Homogeneous Neural Networks

Figure 3 for The Asymmetric Maximum Margin Bias of Quasi-Homogeneous Neural Networks

Abstract:In this work, we explore the maximum-margin bias of quasi-homogeneous neural networks trained with gradient flow on an exponential loss and past a point of separability. We introduce the class of quasi-homogeneous models, which is expressive enough to describe nearly all neural networks with homogeneous activations, even those with biases, residual connections, and normalization layers, while structured enough to enable geometric analysis of its gradient dynamics. Using this analysis, we generalize the existing results of maximum-margin bias for homogeneous networks to this richer class of models. We find that gradient flow implicitly favors a subset of the parameters, unlike in the case of a homogeneous model where all parameters are treated equally. We demonstrate through simple examples how this strong favoritism toward minimizing an asymmetric norm can degrade the robustness of quasi-homogeneous models. On the other hand, we conjecture that this norm-minimization discards, when possible, unnecessary higher-order parameters, reducing the model to a sparser parameterization. Lastly, by applying our theorem to sufficiently expressive neural networks with normalization layers, we reveal a universal mechanism behind the empirical phenomenon of Neural Collapse.

* 33 pages, 5 figures

Via

Access Paper or Ask Questions

Rethinking the limiting dynamics of SGD: modified loss, phase space oscillations, and anomalous diffusion

Jul 19, 2021

Daniel Kunin, Javier Sagastuy-Brena, Lauren Gillespie, Eshed Margalit, Hidenori Tanaka, Surya Ganguli, Daniel L. K. Yamins

Figure 1 for Rethinking the limiting dynamics of SGD: modified loss, phase space oscillations, and anomalous diffusion

Figure 2 for Rethinking the limiting dynamics of SGD: modified loss, phase space oscillations, and anomalous diffusion

Figure 3 for Rethinking the limiting dynamics of SGD: modified loss, phase space oscillations, and anomalous diffusion

Figure 4 for Rethinking the limiting dynamics of SGD: modified loss, phase space oscillations, and anomalous diffusion

Abstract:In this work we explore the limiting dynamics of deep neural networks trained with stochastic gradient descent (SGD). We find empirically that long after performance has converged, networks continue to move through parameter space by a process of anomalous diffusion in which distance travelled grows as a power law in the number of gradient updates with a nontrivial exponent. We reveal an intricate interaction between the hyperparameters of optimization, the structure in the gradient noise, and the Hessian matrix at the end of training that explains this anomalous diffusion. To build this understanding, we first derive a continuous-time model for SGD with finite learning rates and batch sizes as an underdamped Langevin equation. We study this equation in the setting of linear regression, where we can derive exact, analytic expressions for the phase space dynamics of the parameters and their instantaneous velocities from initialization to stationarity. Using the Fokker-Planck equation, we show that the key ingredient driving these dynamics is not the original training loss, but rather the combination of a modified loss, which implicitly regularizes the velocity, and probability currents, which cause oscillations in phase space. We identify qualitative and quantitative predictions of this theory in the dynamics of a ResNet-18 model trained on ImageNet. Through the lens of statistical physics, we uncover a mechanistic origin for the anomalous limiting dynamics of deep neural networks trained with SGD.

* 30 pages, 8 figures

Via

Access Paper or Ask Questions

Noether's Learning Dynamics: The Role of Kinetic Symmetry Breaking in Deep Learning

May 06, 2021

Hidenori Tanaka, Daniel Kunin

Figure 1 for Noether's Learning Dynamics: The Role of Kinetic Symmetry Breaking in Deep Learning

Figure 2 for Noether's Learning Dynamics: The Role of Kinetic Symmetry Breaking in Deep Learning

Abstract:In nature, symmetry governs regularities, while symmetry breaking brings texture. Here, we reveal a novel role of symmetry breaking behind efficiency and stability in learning, a critical issue in machine learning. Recent experiments suggest that the symmetry of the loss function is closely related to the learning performance. This raises a fundamental question. Is such symmetry beneficial, harmful, or irrelevant to the success of learning? Here, we demystify this question and pose symmetry breaking as a new design principle by considering the symmetry of the learning rule in addition to the loss function. We model the discrete learning dynamics using a continuous-time Lagrangian formulation, in which the learning rule corresponds to the kinetic energy and the loss function corresponds to the potential energy. We identify kinetic asymmetry unique to learning systems, where the kinetic energy often does not have the same symmetry as the potential (loss) function reflecting the non-physical symmetries of the loss function and the non-Euclidean metric used in learning rules. We generalize Noether's theorem known in physics to explicitly take into account this kinetic asymmetry and derive the resulting motion of the Noether charge. Finally, we apply our theory to modern deep networks with normalization layers and reveal a mechanism of implicit adaptive optimization induced by the kinetic symmetry breaking.

Via

Access Paper or Ask Questions

Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics

Dec 08, 2020

Daniel Kunin, Javier Sagastuy-Brena, Surya Ganguli, Daniel L. K. Yamins, Hidenori Tanaka

Figure 1 for Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics

Figure 2 for Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics

Figure 3 for Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics

Figure 4 for Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics

Abstract:Predicting the dynamics of neural network parameters during training is one of the key challenges in building a theoretical foundation for deep learning. A central obstacle is that the motion of a network in high-dimensional parameter space undergoes discrete finite steps along complex stochastic gradients derived from real-world datasets. We circumvent this obstacle through a unifying theoretical framework based on intrinsic symmetries embedded in a network's architecture that are present for any dataset. We show that any such symmetry imposes stringent geometric constraints on gradients and Hessians, leading to an associated conservation law in the continuous-time limit of stochastic gradient descent (SGD), akin to Noether's theorem in physics. We further show that finite learning rates used in practice can actually break these symmetry induced conservation laws. We apply tools from finite difference methods to derive modified gradient flow, a differential equation that better approximates the numerical trajectory taken by SGD at finite learning rates. We combine modified gradient flow with our framework of symmetries to derive exact integral expressions for the dynamics of certain parameter combinations. We empirically validate our analytic predictions for learning dynamics on VGG-16 trained on Tiny ImageNet. Overall, by exploiting symmetry, our work demonstrates that we can analytically describe the learning dynamics of various parameter combinations at finite learning rates and batch sizes for state of the art architectures trained on any dataset.

* 28 pages, 17 figures

Via

Access Paper or Ask Questions

Pruning neural networks without any data by iteratively conserving synaptic flow

Jun 09, 2020

Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins, Surya Ganguli

Figure 1 for Pruning neural networks without any data by iteratively conserving synaptic flow

Figure 2 for Pruning neural networks without any data by iteratively conserving synaptic flow

Figure 3 for Pruning neural networks without any data by iteratively conserving synaptic flow

Figure 4 for Pruning neural networks without any data by iteratively conserving synaptic flow

Abstract:Pruning the parameters of deep neural networks has generated intense interest due to potential savings in time, memory and energy both during training and at test time. Recent works have identified, through an expensive sequence of training and pruning cycles, the existence of winning lottery tickets or sparse trainable subnetworks at initialization. This raises a foundational question: can we identify highly sparse trainable subnetworks at initialization, without ever training, or indeed without ever looking at the data? We provide an affirmative answer to this question through theory driven algorithm design. We first mathematically formulate and experimentally verify a conservation law that explains why existing gradient-based pruning algorithms at initialization suffer from layer-collapse, the premature pruning of an entire layer rendering a network untrainable. This theory also elucidates how layer-collapse can be entirely avoided, motivating a novel pruning algorithm Iterative Synaptic Flow Pruning (SynFlow). This algorithm can be interpreted as preserving the total flow of synaptic strengths through the network at initialization subject to a sparsity constraint. Notably, this algorithm makes no reference to the training data and consistently outperforms existing state-of-the-art pruning algorithms at initialization over a range of models (VGG and ResNet), datasets (CIFAR-10/100 and Tiny ImageNet), and sparsity constraints (up to 99.9 percent). Thus our data-agnostic pruning algorithm challenges the existing paradigm that data must be used to quantify which synapses are important.

Via

Access Paper or Ask Questions

Two Routes to Scalable Credit Assignment without Weight Symmetry

Feb 28, 2020

Daniel Kunin, Aran Nayebi, Javier Sagastuy-Brena, Surya Ganguli, Jon Bloom, Daniel L. K. Yamins

Figure 1 for Two Routes to Scalable Credit Assignment without Weight Symmetry

Figure 2 for Two Routes to Scalable Credit Assignment without Weight Symmetry

Figure 3 for Two Routes to Scalable Credit Assignment without Weight Symmetry

Figure 4 for Two Routes to Scalable Credit Assignment without Weight Symmetry

Abstract:The neural plausibility of backpropagation has long been disputed, primarily for its use of non-local weight transport - the biologically dubious requirement that one neuron instantaneously measure the synaptic weights of another. Until recently, attempts to create local learning rules that avoid weight transport have typically failed in the large-scale learning scenarios where backpropagation shines, e.g. ImageNet categorization with deep convolutional networks. Here, we investigate a recently proposed local learning rule that yields competitive performance with backpropagation and find that it is highly sensitive to metaparameter choices, requiring laborious tuning that does not transfer across network architecture. Our analysis indicates the underlying mathematical reason for this instability, allowing us to identify a more robust local learning rule that better transfers without metaparameter tuning. Nonetheless, we find a performance and stability gap between this local rule and backpropagation that widens with increasing model depth. We then investigate several non-local learning rules that relax the need for instantaneous weight transport into a more biologically-plausible "weight estimation" process, showing that these rules match state-of-the-art performance on deep networks and operate effectively in the presence of noisy updates. Taken together, our results suggest two routes towards the discovery of neural implementations for credit assignment without weight symmetry: further improvement of local rules so that they perform consistently across architectures and the identification of biological implementations for non-local learning mechanisms.

* 19 pages including supplementary information, 10 figures

Via

Access Paper or Ask Questions