Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:ATP: Enabling Fast LLM Serving via Attention on Top Principal Keys

Mar 01, 2024

Yue Niu, Saurav Prakash, Salman Avestimehr

Figure 1 for ATP: Enabling Fast LLM Serving via Attention on Top Principal Keys

Figure 2 for ATP: Enabling Fast LLM Serving via Attention on Top Principal Keys

Figure 3 for ATP: Enabling Fast LLM Serving via Attention on Top Principal Keys

Figure 4 for ATP: Enabling Fast LLM Serving via Attention on Top Principal Keys

Share this with someone who'll enjoy it:

Abstract:We propose a new attention mechanism with linear complexity, ATP, that fixates \textbf{A}ttention on \textbf{T}op \textbf{P}rincipal keys, rather than on each individual token. Particularly, ATP is driven by an important observation that input sequences are typically low-rank, i.e., input sequences can be represented by a few principal bases. Therefore, instead of directly iterating over all the input tokens, ATP transforms inputs into an orthogonal space and computes attention only on the top principal bases (keys). Owing to the observed low-rank structure in input sequences, ATP is able to capture semantic relationships in input sequences with a few principal keys. Furthermore, the attention complexity is reduced from \emph{quadratic} to \emph{linear} without incurring a noticeable performance drop. ATP further reduces complexity for other linear layers with low-rank inputs, leading to more speedup compared to prior works that solely target the attention module. Our evaluations on various models (e.g., BERT and Llama) demonstrate that ATP achieves comparable accuracy with much lower computation and memory complexity than the standard attention mechanism. In particular, ATP barely loses accuracy with only $1/2$ principal keys, and only incurs around $2\%$ accuracy drops with $1/4$ principal keys.

* 10 pages, 7 figures, 8 tables

View paper on

Share this with someone who'll enjoy it:

Title:ATP: Enabling Fast LLM Serving via Attention on Top Principal Keys

Paper and Code