Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens

May 07, 2023

Zhanpeng Zeng, Cole Hawkins, Mingyi Hong, Aston Zhang, Nikolaos Pappas, Vikas Singh, Shuai Zheng

Figure 1 for Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens

Figure 2 for Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens

Figure 3 for Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens

Figure 4 for Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens

Share this with someone who'll enjoy it:

Abstract:Transformer models are foundational to natural language processing (NLP) and computer vision. Despite various recent works devoted to reducing the quadratic cost of such models (as a function of the sequence length $n$), dealing with ultra long sequences efficiently (e.g., with more than 16K tokens) remains challenging. Applications such as answering questions based on an entire book or summarizing a scientific article are inefficient or infeasible. In this paper, we propose to significantly reduce the dependency of a Transformer model's complexity on $n$, by compressing the input into a representation whose size $r$ is independent of $n$ at each layer. Specifically, by exploiting the fact that in many tasks, only a small subset of special tokens (we call VIP-tokens) are most relevant to the final prediction, we propose a VIP-token centric compression (Vcc) scheme which selectively compresses the input sequence based on their impact on approximating the representation of these VIP-tokens. Compared with competitive baselines, the proposed algorithm not only is efficient (achieving more than $3\times$ efficiency improvement compared to baselines on 4K and 16K lengths), but also achieves competitive or better performance on a large number of tasks. Further, we show that our algorithm can be scaled to 128K tokens (or more) while consistently offering accuracy improvement.

* 10 pages main text, 11 pages appendix, preprint

View paper on

Share this with someone who'll enjoy it:

Title:Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens

Paper and Code