Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals

Dec 01, 2023

Tam Nguyen, Tan M. Nguyen, Richard G. Baraniuk

Figure 1 for Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals

Figure 2 for Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals

Figure 3 for Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals

Figure 4 for Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals

Share this with someone who'll enjoy it:

Abstract:Transformers have achieved remarkable success in a wide range of natural language processing and computer vision applications. However, the representation capacity of a deep transformer model is degraded due to the over-smoothing issue in which the token representations become identical when the model's depth grows. In this work, we show that self-attention layers in transformers minimize a functional which promotes smoothness, thereby causing token uniformity. We then propose a novel regularizer that penalizes the norm of the difference between the smooth output tokens from self-attention and the input tokens to preserve the fidelity of the tokens. Minimizing the resulting regularized energy functional, we derive the Neural Transformer with a Regularized Nonlocal Functional (NeuTRENO), a novel class of transformer models that can mitigate the over-smoothing issue. We empirically demonstrate the advantages of NeuTRENO over the baseline transformers and state-of-the-art methods in reducing the over-smoothing of token representations on various practical tasks, including object classification, image segmentation, and language modeling.

* 24 papes

View paper on

Share this with someone who'll enjoy it:

Title:Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals

Paper and Code