Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention

Oct 03, 2023

Yuandong Tian, Yiping Wang, Zhenyu Zhang, Beidi Chen, Simon Du

Figure 1 for JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention

Figure 2 for JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention

Figure 3 for JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention

Figure 4 for JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention

Share this with someone who'll enjoy it:

Abstract:We propose Joint MLP/Attention (JoMA) dynamics, a novel mathematical framework to understand the training procedure of multilayer Transformer architectures. This is achieved by integrating out the self-attention layer in Transformers, producing a modified dynamics of MLP layers only. JoMA removes unrealistic assumptions in previous analysis (e.g., lack of residual connection) and predicts that the attention first becomes sparse (to learn salient tokens), then dense (to learn less salient tokens) in the presence of nonlinear activations, while in the linear case, it is consistent with existing works that show attention becomes sparse over time. We leverage JoMA to qualitatively explains how tokens are combined to form hierarchies in multilayer Transformers, when the input tokens are generated by a latent hierarchical generative model. Experiments on models trained from real-world dataset (Wikitext2/Wikitext103) and various pre-trained models (OPT, Pythia) verify our theoretical findings.

View paper on

Share this with someone who'll enjoy it:

Title:JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention

Paper and Code