Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems

Oct 19, 2023

David T. Hoffmann, Simon Schrodi, Nadine Behrmann, Volker Fischer, Thomas Brox

Figure 1 for Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems

Figure 2 for Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems

Figure 3 for Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems

Figure 4 for Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems

Share this with someone who'll enjoy it:

Abstract:In this work, we study rapid, step-wise improvements of the loss in transformers when being confronted with multi-step decision tasks. We found that transformers struggle to learn the intermediate tasks, whereas CNNs have no such issue on the tasks we studied. When transformers learn the intermediate task, they do this rapidly and unexpectedly after both training and validation loss saturated for hundreds of epochs. We call these rapid improvements Eureka-moments, since the transformer appears to suddenly learn a previously incomprehensible task. Similar leaps in performance have become known as Grokking. In contrast to Grokking, for Eureka-moments, both the validation and the training loss saturate before rapidly improving. We trace the problem back to the Softmax function in the self-attention block of transformers and show ways to alleviate the problem. These fixes improve training speed. The improved models reach 95% of the baseline model in just 20% of training steps while having a much higher likelihood to learn the intermediate task, lead to higher final accuracy and are more robust to hyper-parameters.

View paper on

Share this with someone who'll enjoy it:

Title:Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems

Paper and Code