Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

Dec 28, 2020

Stanislaw Jastrzebski, Devansh Arpit, Oliver Astrand, Giancarlo Kerg, Huan Wang, Caiming Xiong, Richard Socher, Kyunghyun Cho, Krzysztof Geras

Figure 1 for Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

Figure 2 for Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

Figure 3 for Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

Figure 4 for Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

Share this with someone who'll enjoy it:

Abstract:The early phase of training has been shown to be important in two ways for deep neural networks. First, the degree of regularization in this phase significantly impacts the final generalization. Second, it is accompanied by a rapid change in the local loss curvature influenced by regularization choices. Connecting these two findings, we show that stochastic gradient descent (SGD) implicitly penalizes the trace of the Fisher Information Matrix (FIM) from the beginning of training. We argue it is an implicit regularizer in SGD by showing that explicitly penalizing the trace of the FIM can significantly improve generalization. We further show that the early value of the trace of the FIM correlates strongly with the final generalization. We highlight that in the absence of implicit or explicit regularization, the trace of the FIM can increase to a large value early in training, to which we refer as catastrophic Fisher explosion. Finally, to gain insight into the regularization effect of penalizing the trace of the FIM, we show that 1) it limits memorization by reducing the learning speed of examples with noisy labels more than that of the clean examples, and 2) trajectories with a low initial trace of the FIM end in flat minima, which are commonly associated with good generalization.

* The last two authors contributed equally

View paper on

OpenReview

Share this with someone who'll enjoy it:

Title:Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

Paper and Code