Abstract:Our theoretical understanding of neural networks is lagging behind their empirical success. One of the important unexplained phenomena is why and how, during the process of training with gradient descent, the theoretical capacity of neural networks is reduced to an effective capacity that fits the task. We here investigate the mechanism by which gradient descent achieves this through analyzing the learning dynamics at the level of individual neurons in single hidden layer ReLU networks. We identify three dynamical principles -- mutual alignment, unlocking and racing -- that together explain why we can often successfully reduce capacity after training through the merging of equivalent neurons or the pruning of low norm weights. We specifically explain the mechanism behind the lottery ticket conjecture, or why the specific, beneficial initial conditions of some neurons lead them to obtain higher weight norms.
Abstract:Our theoretical understanding of the inner workings of general convolutional neural networks (CNN) is limited. We here present a new stepping stone towards such understanding in the form of a theory of learning in linear CNNs. By analyzing the gradient descent equations, we discover that using convolutions leads to a mismatch between the dataset structure and the network structure. We show that linear CNNs discover the statistical structure of the dataset with non-linear, stage-like transitions, and that the speed of discovery changes depending on this structural mismatch. Moreover, we find that the mismatch lies at the heart of what we call the 'dominant frequency bias', where linear CNNs arrive at these discoveries using only the dominant frequencies of the different structural parts present in the dataset. Our findings can help explain several characteristics of general CNNs, such as their shortcut learning and their tendency to rely on texture instead of shape.