Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level

Mar 07, 2024

Ali Hassani, Wen-Mei Hwu, Humphrey Shi

Figure 1 for Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level

Figure 2 for Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level

Figure 3 for Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level

Figure 4 for Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level

Share this with someone who'll enjoy it:

Abstract:Neighborhood attention reduces the cost of self attention by restricting each token's attention span to its nearest neighbors. This restriction, parameterized by a window size and dilation factor, draws a spectrum of possible attention patterns between linear projection and self attention. Neighborhood attention, and more generally sliding window attention patterns, have long been bounded by infrastructure, particularly in higher-rank spaces (2-D and 3-D), calling for the development of custom kernels, which have been limited in either functionality, or performance, if not both. In this work, we first show that neighborhood attention can be represented as a batched GEMM problem, similar to standard attention, and implement it for 1-D and 2-D neighborhood attention. These kernels on average provide 895% and 272% improvement in full precision latency compared to existing naive kernels for 1-D and 2-D neighborhood attention respectively. We find certain inherent inefficiencies in all unfused neighborhood attention kernels that bound their performance and lower-precision scalability. We also developed fused neighborhood attention; an adaptation of fused dot-product attention kernels that allow fine-grained control over attention across different spatial axes. Known for reducing the quadratic time complexity of self attention to a linear complexity, neighborhood attention can now enjoy a reduced and constant memory footprint, and record-breaking half precision latency. We observe that our fused kernels successfully circumvent some of the unavoidable inefficiencies in unfused implementations. While our unfused GEMM-based kernels only improve half precision performance compared to naive kernels by an average of 496% and 113% in 1-D and 2-D problems respectively, our fused kernels improve naive kernels by an average of 1607% and 581% in 1-D and 2-D problems respectively.

* Project page: https://github.com/SHI-Labs/NATTEN

View paper on

Share this with someone who'll enjoy it:

Title:Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level

Paper and Code