Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse

Jun 07, 2022

Lorenzo Noci, Sotiris Anagnostidis, Luca Biggio, Antonio Orvieto, Sidak Pal Singh, Aurelien Lucchi

Figure 1 for Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse

Figure 2 for Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse

Figure 3 for Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse

Figure 4 for Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse

Share this with someone who'll enjoy it:

Abstract:Transformers have achieved remarkable success in several domains, ranging from natural language processing to computer vision. Nevertheless, it has been recently shown that stacking self-attention layers - the distinctive architectural component of Transformers - can result in rank collapse of the tokens' representations at initialization. The question of if and how rank collapse affects training is still largely unanswered, and its investigation is necessary for a more comprehensive understanding of this architecture. In this work, we shed new light on the causes and the effects of this phenomenon. First, we show that rank collapse of the tokens' representations hinders training by causing the gradients of the queries and keys to vanish at initialization. Furthermore, we provide a thorough description of the origin of rank collapse and discuss how to prevent it via an appropriate depth-dependent scaling of the residual branches. Finally, our analysis unveils that specific architectural hyperparameters affect the gradients of queries and values differently, leading to disproportionate gradient norms. This suggests an explanation for the widespread use of adaptive methods for Transformers' optimization.

View paper on

OpenReview

Share this with someone who'll enjoy it:

Title:Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse

Paper and Code