Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression

Feb 20, 2025

Haoyu Wang, Tong Teng, Tianyu Guo, An Xiao, Duyu Tang, Hanting Chen, Yunhe Wang

Figure 1 for Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression

Figure 2 for Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression

Figure 3 for Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression

Figure 4 for Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression

Share this with someone who'll enjoy it:

Abstract:Handling long-context sequences efficiently remains a significant challenge in large language models (LLMs). Existing methods for token selection in sequence extrapolation either employ a permanent eviction strategy or select tokens by chunk, which may lead to the loss of critical information. We propose Efficient Selective Attention (ESA), a novel approach that extends context length by efficiently selecting the most critical tokens at the token level to compute attention. ESA reduces the computational complexity of token selection by compressing query and key vectors into lower-dimensional representations. We evaluate ESA on long sequence benchmarks with maximum lengths up to 256k using open-source LLMs with context lengths of 8k and 32k. ESA outperforms other selective attention methods, especially in tasks requiring the retrieval of multiple pieces of information, achieving comparable performance to full-attention extrapolation methods across various tasks, with superior results in certain tasks.

* 14 pages,2 figures

View paper on

Share this with someone who'll enjoy it:

Title:Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression

Paper and Code