Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Does Self-Attention Need Separate Weights in Transformers?

Nov 30, 2024

Md Kowsher, Nusrat Jahan Prottasha, Chun-Nam Yu

Figure 1 for Does Self-Attention Need Separate Weights in Transformers?

Figure 2 for Does Self-Attention Need Separate Weights in Transformers?

Figure 3 for Does Self-Attention Need Separate Weights in Transformers?

Figure 4 for Does Self-Attention Need Separate Weights in Transformers?

Share this with someone who'll enjoy it:

Abstract:The success of self-attention lies in its ability to capture long-range dependencies and enhance context understanding, but it is limited by its computational complexity and challenges in handling sequential data with inherent directionality. This work introduces a shared weight self-attention-based BERT model that only learns one weight matrix for (Key, Value, and Query) representations instead of three individual matrices for each of them. Our shared weight attention reduces the training parameter size by more than half and training time by around one-tenth. Furthermore, we demonstrate higher prediction accuracy on small tasks of GLUE over the BERT baseline and in particular a generalization power on noisy and out-of-domain data. Experimental results indicate that our shared self-attention method achieves a parameter size reduction of 66.53% in the attention block. In the GLUE dataset, the shared weight self-attention-based BERT model demonstrates accuracy improvements of 0.38%, 5.81%, and 1.06% over the standard, symmetric, and pairwise attention-based BERT models, respectively. The model and source code are available at Anonymous.

* Preprint paper

View paper on

Share this with someone who'll enjoy it:

Title:Does Self-Attention Need Separate Weights in Transformers?

Paper and Code