Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Attamba: Attending To Multi-Token States

Nov 26, 2024

Yash Akhauri, Safeen Huda, Mohamed S. Abdelfattah

Figure 1 for Attamba: Attending To Multi-Token States

Figure 2 for Attamba: Attending To Multi-Token States

Figure 3 for Attamba: Attending To Multi-Token States

Figure 4 for Attamba: Attending To Multi-Token States

Share this with someone who'll enjoy it:

Abstract:When predicting the next token in a sequence, vanilla transformers compute attention over all previous tokens, resulting in quadratic scaling of compute with sequence length. State-space models compress the entire sequence of tokens into a fixed-dimensional representation to improve efficiency, while other architectures achieve sub-quadratic complexity via low-rank projections or sparse attention patterns over the sequence. In this paper, we introduce Attamba, a novel architecture that uses state-space models to compress chunks of tokens and applies attention on these compressed key-value representations. We find that replacing key and value projections in a transformer with SSMs can improve model quality and enable flexible token chunking, resulting in 24% improved perplexity with transformer of similar KV-Cache and attention footprint, and ~4 times smaller KV-Cache and Attention FLOPs for 5% perplexity trade-off. Attamba can perform attention on chunked-sequences of variable length, enabling a smooth transition between quadratic and linear scaling, offering adaptable efficiency gains.

View paper on

Share this with someone who'll enjoy it:

Title:Attamba: Attending To Multi-Token States

Paper and Code