Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:SEA: Sparse Linear Attention with Estimated Attention Mask

Oct 03, 2023

Heejun Lee, Jina Kim, Jeffrey Willette, Sung Ju Hwang

Figure 1 for SEA: Sparse Linear Attention with Estimated Attention Mask

Figure 2 for SEA: Sparse Linear Attention with Estimated Attention Mask

Figure 3 for SEA: Sparse Linear Attention with Estimated Attention Mask

Figure 4 for SEA: Sparse Linear Attention with Estimated Attention Mask

Share this with someone who'll enjoy it:

Abstract:The transformer architecture has made breakthroughs in recent years on tasks which require modeling pairwise relationships between sequential elements, as is the case in natural language understanding. However, transformers struggle with long sequences due to the quadratic complexity of the attention operation, and previous research has aimed to lower the complexity by sparsifying or linearly approximating the attention matrix. Yet, these approaches cannot straightforwardly distill knowledge from a teacher's attention matrix, and often require complete retraining from scratch. Furthermore, previous sparse and linear approaches may also lose interpretability if they do not produce full quadratic attention matrices. To address these challenges, we propose SEA: Sparse linear attention with an Estimated Attention mask. SEA estimates the attention matrix with linear complexity via kernel-based linear attention, then creates a sparse approximation to the full attention matrix with a top-k selection to perform a sparse attention operation. For language modeling tasks (Wikitext2), previous linear and sparse attention methods show a roughly two-fold worse perplexity scores over the quadratic OPT-125M baseline, while SEA achieves an even better perplexity than OPT-125M, using roughly half as much memory as OPT-125M. Moreover, SEA maintains an interpretable attention matrix and can utilize knowledge distillation to lower the complexity of existing pretrained transformers. We believe that our work will have a large practical impact, as it opens the possibility of running large transformers on resource-limited devices with less memory.

* 9 main pages

View paper on

Share this with someone who'll enjoy it:

Title:SEA: Sparse Linear Attention with Estimated Attention Mask

Paper and Code