Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Michael R. DeWeese

Alternating Gradient Flows: A Theory of Feature Learning in Two-layer Neural Networks

Jun 06, 2025

Daniel Kunin, Giovanni Luca Marchetti, Feng Chen, Dhruva Karkada, James B. Simon, Michael R. DeWeese, Surya Ganguli, Nina Miolane

Abstract:What features neural networks learn, and how, remains an open question. In this paper, we introduce Alternating Gradient Flows (AGF), an algorithmic framework that describes the dynamics of feature learning in two-layer networks trained from small initialization. Prior works have shown that gradient flow in this regime exhibits a staircase-like loss curve, alternating between plateaus where neurons slowly align to useful directions and sharp drops where neurons rapidly grow in norm. AGF approximates this behavior as an alternating two-step process: maximizing a utility function over dormant neurons and minimizing a cost function over active ones. AGF begins with all neurons dormant. At each round, a dormant neuron activates, triggering the acquisition of a feature and a drop in the loss. AGF quantifies the order, timing, and magnitude of these drops, matching experiments across architectures. We show that AGF unifies and extends existing saddle-to-saddle analyses in fully connected linear networks and attention-only linear transformers, where the learned features are singular modes and principal components, respectively. In diagonal linear networks, we prove AGF converges to gradient flow in the limit of vanishing initialization. Applying AGF to quadratic networks trained to perform modular addition, we give the first complete characterization of the training dynamics, revealing that networks learn Fourier features in decreasing order of coefficient magnitude. Altogether, AGF offers a promising step towards understanding feature learning in neural networks.

* 35 pages, 7 figures

Via

Access Paper or Ask Questions

Neural Tangent Kernel Eigenvalues Accurately Predict Generalization

Oct 13, 2021

James B. Simon, Madeline Dickens, Michael R. DeWeese

Figure 1 for Neural Tangent Kernel Eigenvalues Accurately Predict Generalization

Figure 2 for Neural Tangent Kernel Eigenvalues Accurately Predict Generalization

Figure 3 for Neural Tangent Kernel Eigenvalues Accurately Predict Generalization

Figure 4 for Neural Tangent Kernel Eigenvalues Accurately Predict Generalization

Abstract:Finding a quantitative theory of neural network generalization has long been a central goal of deep learning research. We extend recent results to demonstrate that, by examining the eigensystem of a neural network's "neural tangent kernel", one can predict its generalization performance when learning arbitrary functions. Our theory accurately predicts not only test mean-squared-error but all first- and second-order statistics of the network's learned function. Furthermore, using a measure quantifying the "learnability" of a given target function, we prove a new "no-free-lunch" theorem characterizing a fundamental tradeoff in the inductive bias of wide neural networks: improving a network's generalization for a given target function must worsen its generalization for orthogonal functions. We further demonstrate the utility of our theory by analytically predicting two surprising phenomena - worse-than-chance generalization on hard-to-learn functions and nonmonotonic error curves in the small data regime - which we subsequently observe in experiments. Though our theory is derived for infinite-width architectures, we find it agrees with networks as narrow as width 20, suggesting it is predictive of generalization in practical neural networks. Code replicating our results is available at https://github.com/james-simon/eigenlearning.

* 10 pages (main text), 24 pages (total), 10 figures

Via

Access Paper or Ask Questions

On the Power of Shallow Learning

Jun 06, 2021

James B. Simon, Sajant Anand, Michael R. DeWeese

Figure 1 for On the Power of Shallow Learning

Figure 2 for On the Power of Shallow Learning

Figure 3 for On the Power of Shallow Learning

Abstract:A deluge of recent work has explored equivalences between wide neural networks and kernel methods. A central theme is that one can analytically find the kernel corresponding to a given wide network architecture, but despite major implications for architecture design, no work to date has asked the converse question: given a kernel, can one find a network that realizes it? We affirmatively answer this question for fully-connected architectures, completely characterizing the space of achievable kernels. Furthermore, we give a surprising constructive proof that any kernel of any wide, deep, fully-connected net can also be achieved with a network with just one hidden layer and a specially-designed pointwise activation function. We experimentally verify our construction and demonstrate that, by just choosing the activation function, we can design a wide shallow network that mimics the generalization performance of any wide, deep, fully-connected network.

* 13 pages, 5 figures

Via

Access Paper or Ask Questions

A new method for parameter estimation in probabilistic models: Minimum probability flow

Jul 17, 2020

Jascha Sohl-Dickstein, Peter Battaglino, Michael R. DeWeese

Figure 1 for A new method for parameter estimation in probabilistic models: Minimum probability flow

Figure 2 for A new method for parameter estimation in probabilistic models: Minimum probability flow

Figure 3 for A new method for parameter estimation in probabilistic models: Minimum probability flow

Abstract:Fitting probabilistic models to data is often difficult, due to the general intractability of the partition function. We propose a new parameter fitting method, Minimum Probability Flow (MPF), which is applicable to any parametric model. We demonstrate parameter estimation using MPF in two cases: a continuous state space model, and an Ising spin glass. In the latter case it outperforms current techniques by at least an order of magnitude in convergence time with lower error in the recovered coupling parameters.

* Originally published 2011. Uploaded to arXiv 2020. arXiv admin note: text overlap with arXiv:0906.4779, arXiv:1205.4295

Via

Access Paper or Ask Questions

Critical Point-Finding Methods Reveal Gradient-Flat Regions of Deep Network Losses

Mar 23, 2020

Charles G. Frye, James Simon, Neha S. Wadia, Andrew Ligeralde, Michael R. DeWeese, Kristofer E. Bouchard

Figure 1 for Critical Point-Finding Methods Reveal Gradient-Flat Regions of Deep Network Losses

Figure 2 for Critical Point-Finding Methods Reveal Gradient-Flat Regions of Deep Network Losses

Figure 3 for Critical Point-Finding Methods Reveal Gradient-Flat Regions of Deep Network Losses

Figure 4 for Critical Point-Finding Methods Reveal Gradient-Flat Regions of Deep Network Losses

Abstract:Despite the fact that the loss functions of deep neural networks are highly non-convex, gradient-based optimization algorithms converge to approximately the same performance from many random initial points. One thread of work has focused on explaining this phenomenon by characterizing the local curvature near critical points of the loss function, where the gradients are near zero, and demonstrating that neural network losses enjoy a no-bad-local-minima property and an abundance of saddle points. We report here that the methods used to find these putative critical points suffer from a bad local minima problem of their own: they often converge to or pass through regions where the gradient norm has a stationary point. We call these gradient-flat regions, since they arise when the gradient is approximately in the kernel of the Hessian, such that the loss is locally approximately linear, or flat, in the direction of the gradient. We describe how the presence of these regions necessitates care in both interpreting past results that claimed to find critical points of neural network losses and in designing second-order methods for optimizing neural networks.

* 18 pages, 5 figures

Via

Access Paper or Ask Questions

Design of optical neural networks with component imprecisions

Dec 13, 2019

Michael Y. -S. Fang, Sasikanth Manipatruni, Casimir Wierzynski, Amir Khosrowshahi, Michael R. DeWeese

Figure 1 for Design of optical neural networks with component imprecisions

Figure 2 for Design of optical neural networks with component imprecisions

Figure 3 for Design of optical neural networks with component imprecisions

Figure 4 for Design of optical neural networks with component imprecisions

Abstract:For the benefit of designing scalable, fault resistant optical neural networks (ONNs), we investigate the effects architectural designs have on the ONNs' robustness to imprecise components. We train two ONNs -- one with a more tunable design (GridNet) and one with better fault tolerance (FFTNet) -- to classify handwritten digits. When simulated without any imperfections, GridNet yields a better accuracy (~98%) than FFTNet (~95%). However, under a small amount of error in their photonic components, the more fault tolerant FFTNet overtakes GridNet. We further provide thorough quantitative and qualitative analyses of ONNs' sensitivity to varying levels and types of imprecisions. Our results offer guidelines for the principled design of fault-tolerant ONNs as well as a foundation for further research.

* Optics express 27.10 (2019): 14009-14029

Via

Access Paper or Ask Questions

Numerically Recovering the Critical Points of a Deep Linear Autoencoder

Jan 29, 2019

Charles G. Frye, Neha S. Wadia, Michael R. DeWeese, Kristofer E. Bouchard

Figure 1 for Numerically Recovering the Critical Points of a Deep Linear Autoencoder

Figure 2 for Numerically Recovering the Critical Points of a Deep Linear Autoencoder

Figure 3 for Numerically Recovering the Critical Points of a Deep Linear Autoencoder

Figure 4 for Numerically Recovering the Critical Points of a Deep Linear Autoencoder

Abstract:Numerically locating the critical points of non-convex surfaces is a long-standing problem central to many fields. Recently, the loss surfaces of deep neural networks have been explored to gain insight into outstanding questions in optimization, generalization, and network architecture design. However, the degree to which recently-proposed methods for numerically recovering critical points actually do so has not been thoroughly evaluated. In this paper, we examine this issue in a case for which the ground truth is known: the deep linear autoencoder. We investigate two sub-problems associated with numerical critical point identification: first, because of large parameter counts, it is infeasible to find all of the critical points for contemporary neural networks, necessitating sampling approaches whose characteristics are poorly understood; second, the numerical tolerance for accurately identifying a critical point is unknown, and conservative tolerances are difficult to satisfy. We first identify connections between recently-proposed methods and well-understood methods in other fields, including chemical physics, economics, and algebraic geometry. We find that several methods work well at recovering certain information about loss surfaces, but fail to take an unbiased sample of critical points. Furthermore, numerical tolerance must be very strict to ensure that numerically-identified critical points have similar properties to true analytical critical points. We also identify a recently-published Newton method for optimization that outperforms previous methods as a critical point-finding algorithm. We expect our results will guide future attempts to numerically study critical points in large nonlinear neural networks.

Via

Access Paper or Ask Questions

Hamiltonian Monte Carlo Without Detailed Balance

Mar 25, 2016

Jascha Sohl-Dickstein, Mayur Mudigonda, Michael R. DeWeese

Figure 1 for Hamiltonian Monte Carlo Without Detailed Balance

Figure 2 for Hamiltonian Monte Carlo Without Detailed Balance

Figure 3 for Hamiltonian Monte Carlo Without Detailed Balance

Figure 4 for Hamiltonian Monte Carlo Without Detailed Balance

Abstract:We present a method for performing Hamiltonian Monte Carlo that largely eliminates sample rejection for typical hyperparameters. In situations that would normally lead to rejection, instead a longer trajectory is computed until a new state is reached that can be accepted. This is achieved using Markov chain transitions that satisfy the fixed point equation, but do not satisfy detailed balance. The resulting algorithm significantly suppresses the random walk behavior and wasted function evaluations that are typically the consequence of update rejection. We demonstrate a greater than factor of two improvement in mixing time on three test problems. We release the source code as Python and MATLAB packages.

* Accepted conference submission to ICML 2014 and also featured in a special edition of JMLR. Since updated to include additional literature citations

Via

Access Paper or Ask Questions

A Markov Jump Process for More Efficient Hamiltonian Monte Carlo

Oct 11, 2015

Andrew B. Berger, Mayur Mudigonda, Michael R. DeWeese, Jascha Sohl-Dickstein

Figure 1 for A Markov Jump Process for More Efficient Hamiltonian Monte Carlo

Figure 2 for A Markov Jump Process for More Efficient Hamiltonian Monte Carlo

Figure 3 for A Markov Jump Process for More Efficient Hamiltonian Monte Carlo

Figure 4 for A Markov Jump Process for More Efficient Hamiltonian Monte Carlo

Abstract:In most sampling algorithms, including Hamiltonian Monte Carlo, transition rates between states correspond to the probability of making a transition in a single time step, and are constrained to be less than or equal to 1. We derive a Hamiltonian Monte Carlo algorithm using a continuous time Markov jump process, and are thus able to escape this constraint. Transition rates in a Markov jump process need only be non-negative. We demonstrate that the new algorithm leads to improved mixing for several example problems, both by evaluating the spectral gap of the Markov operator, and by computing autocorrelation as a function of compute time. We release the algorithm as an open source Python package.

Via

Access Paper or Ask Questions

Time Resolution Dependence of Information Measures for Spiking Neurons: Atoms, Scaling, and Universality

Apr 18, 2015

Sarah E. Marzen, Michael R. DeWeese, James P. Crutchfield

Figure 1 for Time Resolution Dependence of Information Measures for Spiking Neurons: Atoms, Scaling, and Universality

Figure 2 for Time Resolution Dependence of Information Measures for Spiking Neurons: Atoms, Scaling, and Universality

Figure 3 for Time Resolution Dependence of Information Measures for Spiking Neurons: Atoms, Scaling, and Universality

Figure 4 for Time Resolution Dependence of Information Measures for Spiking Neurons: Atoms, Scaling, and Universality

Abstract:The mutual information between stimulus and spike-train response is commonly used to monitor neural coding efficiency, but neuronal computation broadly conceived requires more refined and targeted information measures of input-output joint processes. A first step towards that larger goal is to develop information measures for individual output processes, including information generation (entropy rate), stored information (statistical complexity), predictable information (excess entropy), and active information accumulation (bound information rate). We calculate these for spike trains generated by a variety of noise-driven integrate-and-fire neurons as a function of time resolution and for alternating renewal processes. We show that their time-resolution dependence reveals coarse-grained structural properties of interspike interval statistics; e.g., $\tau$-entropy rates that diverge less quickly than the firing rate indicate interspike interval correlations. We also find evidence that the excess entropy and regularized statistical complexity of different types of integrate-and-fire neurons are universal in the continuous-time limit in the sense that they do not depend on mechanism details. This suggests a surprising simplicity in the spike trains generated by these model neurons. Interestingly, neurons with gamma-distributed ISIs and neurons whose spike trains are alternating renewal processes do not fall into the same universality class. These results lead to two conclusions. First, the dependence of information measures on time resolution reveals mechanistic details about spike train generation. Second, information measures can be used as model selection tools for analyzing spike train processes.

* 20 pages, 6 figures; http://csc.ucdavis.edu/~cmg/compmech/pubs/trdctim.htm

Via

Access Paper or Ask Questions