Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks

Oct 12, 2023

Greg Yang, Dingli Yu, Chen Zhu, Soufiane Hayou

Figure 1 for Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks

Figure 2 for Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks

Figure 3 for Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks

Figure 4 for Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks

Share this with someone who'll enjoy it:

Abstract:By classifying infinite-width neural networks and identifying the *optimal* limit, Tensor Programs IV and V demonstrated a universal way, called $\mu$P, for *widthwise hyperparameter transfer*, i.e., predicting optimal hyperparameters of wide neural networks from narrow ones. Here we investigate the analogous classification for *depthwise parametrizations* of deep residual networks (resnets). We classify depthwise parametrizations of block multiplier and learning rate by their infinite-width-then-depth limits. In resnets where each block has only one layer, we identify a unique optimal parametrization, called Depth-$\mu$P that extends $\mu$P and show empirically it admits depthwise hyperparameter transfer. We identify *feature diversity* as a crucial factor in deep networks, and Depth-$\mu$P can be characterized as maximizing both feature learning and feature diversity. Exploiting this, we find that absolute value, among all homogeneous nonlinearities, maximizes feature diversity and indeed empirically leads to significantly better performance. However, if each block is deeper (such as modern transformers), then we find fundamental limitations in all possible infinite-depth limits of such parametrizations, which we illustrate both theoretically and empirically on simple networks as well as Megatron transformer trained on Common Crawl.

View paper on

Share this with someone who'll enjoy it:

Title:Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks

Paper and Code