Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Noa Rubin

Applications of Statistical Field Theory in Deep Learning

Feb 25, 2025

Zohar Ringel, Noa Rubin, Edo Mor, Moritz Helias, Inbar Seroussi

Figure 1 for Applications of Statistical Field Theory in Deep Learning

Figure 2 for Applications of Statistical Field Theory in Deep Learning

Figure 3 for Applications of Statistical Field Theory in Deep Learning

Figure 4 for Applications of Statistical Field Theory in Deep Learning

Abstract:Deep learning algorithms have made incredible strides in the past decade yet due to the complexity of these algorithms, the science of deep learning remains in its early stages. Being an experimentally driven field, it is natural to seek a theory of deep learning within the physics paradigm. As deep learning is largely about learning functions and distributions over functions, statistical field theory, a rich and versatile toolbox for tackling complex distributions over functions (fields) is an obvious choice of formalism. Research efforts carried out in the past few years have demonstrated the ability of field theory to provide useful insights on generalization, implicit bias, and feature learning effects. Here we provide a pedagogical review of this emerging line of research.

Via

Access Paper or Ask Questions

From Kernels to Features: A Multi-Scale Adaptive Theory of Feature Learning

Feb 05, 2025

Noa Rubin, Kirsten Fischer, Javed Lindner, David Dahmen, Inbar Seroussi, Zohar Ringel, Michael Krämer, Moritz Helias

Abstract:Theoretically describing feature learning in neural networks is crucial for understanding their expressive power and inductive biases, motivating various approaches. Some approaches describe network behavior after training through a simple change in kernel scale from initialization, resulting in a generalization power comparable to a Gaussian process. Conversely, in other approaches training results in the adaptation of the kernel to the data, involving complex directional changes to the kernel. While these approaches capture different facets of network behavior, their relationship and respective strengths across scaling regimes remains an open question. This work presents a theoretical framework of multi-scale adaptive feature learning bridging these approaches. Using methods from statistical mechanics, we derive analytical expressions for network output statistics which are valid across scaling regimes and in the continuum between them. A systematic expansion of the network's probability distribution reveals that mean-field scaling requires only a saddle-point approximation, while standard scaling necessitates additional correction terms. Remarkably, we find across regimes that kernel adaptation can be reduced to an effective kernel rescaling when predicting the mean network output of a linear network. However, even in this case, the multi-scale adaptive approach captures directional feature learning effects, providing richer insights than what could be recovered from a rescaling of the kernel alone.

* 24 pages, 6 figures

Via

Access Paper or Ask Questions

Droplets of Good Representations: Grokking as a First Order Phase Transition in Two Layer Networks

Oct 05, 2023

Noa Rubin, Inbar Seroussi, Zohar Ringel

Figure 1 for Droplets of Good Representations: Grokking as a First Order Phase Transition in Two Layer Networks

Figure 2 for Droplets of Good Representations: Grokking as a First Order Phase Transition in Two Layer Networks

Figure 3 for Droplets of Good Representations: Grokking as a First Order Phase Transition in Two Layer Networks

Figure 4 for Droplets of Good Representations: Grokking as a First Order Phase Transition in Two Layer Networks

Abstract:A key property of deep neural networks (DNNs) is their ability to learn new features during training. This intriguing aspect of deep learning stands out most clearly in recently reported Grokking phenomena. While mainly reflected as a sudden increase in test accuracy, Grokking is also believed to be a beyond lazy-learning/Gaussian Process (GP) phenomenon involving feature learning. Here we apply a recent development in the theory of feature learning, the adaptive kernel approach, to two teacher-student models with cubic-polynomial and modular addition teachers. We provide analytical predictions on feature learning and Grokking properties of these models and demonstrate a mapping between Grokking and the theory of phase transitions. We show that after Grokking, the state of the DNN is analogous to the mixed phase following a first-order phase transition. In this mixed phase, the DNN generates useful internal representations of the teacher that are sharply distinct from those before the transition.

Via

Access Paper or Ask Questions