Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Length Generalization of Causal Transformers without Position Encoding

Apr 18, 2024

Jie Wang, Tao Ji, Yuanbin Wu, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang, Xiaoling Wang

Figure 1 for Length Generalization of Causal Transformers without Position Encoding

Figure 2 for Length Generalization of Causal Transformers without Position Encoding

Figure 3 for Length Generalization of Causal Transformers without Position Encoding

Figure 4 for Length Generalization of Causal Transformers without Position Encoding

Share this with someone who'll enjoy it:

Abstract:Generalizing to longer sentences is important for recent Transformer-based language models. Besides algorithms manipulating explicit position features, the success of Transformers without position encodings (NoPE) provides a new way to overcome the challenge. In this paper, we study the length generalization property of NoPE. We find that although NoPE can extend to longer sequences than the commonly used explicit position encodings, it still has a limited context length. We identify a connection between the failure of NoPE's generalization and the distraction of attention distributions. We propose a parameter-efficient tuning for searching attention heads' best temperature hyper-parameters, which substantially expands NoPE's context size. Experiments on long sequence language modeling, the synthetic passkey retrieval task and real-world long context tasks show that NoPE can achieve competitive performances with state-of-the-art length generalization algorithms. The source code is publicly accessible

View paper on

Share this with someone who'll enjoy it:

Title:Length Generalization of Causal Transformers without Position Encoding

Paper and Code