Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices

Mar 28, 2025

Jiyu Chen, Shuang Peng, Daxiong Luo, Fan Yang, Renshou Wu, Fangyuan Li, Xiaoxin Chen

Figure 1 for EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices

Figure 2 for EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices

Figure 3 for EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices

Figure 4 for EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices

Share this with someone who'll enjoy it:

Abstract:Transformer-based large language models (LLMs) encounter challenges in processing long sequences on edge devices due to the quadratic complexity of attention mechanisms and growing memory demands from Key-Value (KV) cache. Existing KV cache optimizations struggle with irreversible token eviction in long-output tasks, while alternative sequence modeling architectures prove costly to adopt within established Transformer infrastructure. We present EdgeInfinite, a memory-efficient solution for infinite contexts that integrates compressed memory into Transformer-based LLMs through a trainable memory-gating module. This approach maintains full compatibility with standard Transformer architectures, requiring fine-tuning only a small part of parameters, and enables selective activation of the memory-gating module for long and short context task routing. The experimental result shows that EdgeInfinite achieves comparable performance to baseline Transformer-based LLM on long context benchmarks while optimizing memory consumption and time to first token.

* 8 pages, 3 figures

View paper on

Share this with someone who'll enjoy it:

Title:EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices

Paper and Code