Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Predicting Attention Sparsity in Transformers

Sep 24, 2021

Marcos Treviso, António Góis, Patrick Fernandes, Erick Fonseca, André F. T. Martins

Figure 1 for Predicting Attention Sparsity in Transformers

Figure 2 for Predicting Attention Sparsity in Transformers

Figure 3 for Predicting Attention Sparsity in Transformers

Figure 4 for Predicting Attention Sparsity in Transformers

Share this with someone who'll enjoy it:

Abstract:A bottleneck in transformer architectures is their quadratic complexity with respect to the input sequence, which has motivated a body of work on efficient sparse approximations to softmax. An alternative path, used by entmax transformers, consists of having built-in exact sparse attention; however this approach still requires quadratic computation. In this paper, we propose Sparsefinder, a simple model trained to identify the sparsity pattern of entmax attention before computing it. We experiment with three variants of our method, based on distances, quantization, and clustering, on two tasks: machine translation (attention in the decoder) and masked language modeling (encoder-only). Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph. This allows for detailed comparison between different models, and may guide future benchmarks for sparse models.

View paper on

Share this with someone who'll enjoy it:

Title:Predicting Attention Sparsity in Transformers

Paper and Code