Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Aug 27, 2021

Ofir Press, Noah A. Smith, Mike Lewis

Figure 1 for Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Figure 2 for Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Figure 3 for Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Figure 4 for Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Share this with someone who'll enjoy it:

Abstract:Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question remains open: how to achieve extrapolation at inference time to longer sequences than seen during training? We first show that extrapolation can be improved by changing the position representation method, though we find that existing proposals do not allow efficient extrapolation. We introduce a simple and efficient method, Attention with Linear Biases (ALiBi), that allows for extrapolation. ALiBi does not add positional embeddings to the word embeddings; instead, it biases the query-key attention scores with a term that is proportional to their distance. We show that this method allows training a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048, 11% faster and using 11% less memory. ALiBi's inductive bias towards recency allows it to outperform multiple strong position methods on the WikiText-103 benchmark. Finally, we provide analysis of ALiBi to understand why it leads to better performance.

View paper on

OpenReview

Share this with someone who'll enjoy it:

Title:Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Paper and Code