Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

May 07, 2024

Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar

Figure 1 for vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

Figure 2 for vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

Figure 3 for vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

Figure 4 for vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

Share this with someone who'll enjoy it:

Abstract:Efficient use of GPU memory is essential for high throughput LLM inference. Prior systems reserved memory for the KV-cache ahead-of-time, resulting in wasted capacity due to internal fragmentation. Inspired by OS-based virtual memory systems, vLLM proposed PagedAttention to enable dynamic memory allocation for KV-cache. This approach eliminates fragmentation, enabling high-throughput LLM serving with larger batch sizes. However, to be able to allocate physical memory dynamically, PagedAttention changes the layout of KV-cache from contiguous virtual memory to non-contiguous virtual memory. This change requires attention kernels to be rewritten to support paging, and serving framework to implement a memory manager. Thus, the PagedAttention model leads to software complexity, portability issues, redundancy and inefficiency. In this paper, we propose vAttention for dynamic KV-cache memory management. In contrast to PagedAttention, vAttention retains KV-cache in contiguous virtual memory and leverages low-level system support for demand paging, that already exists, to enable on-demand physical memory allocation. Thus, vAttention unburdens the attention kernel developer from having to explicitly support paging and avoids re-implementation of memory management in the serving framework. We show that vAttention enables seamless dynamic memory management for unchanged implementations of various attention kernels. vAttention also generates tokens up to 1.97x faster than vLLM, while processing input prompts up to 3.92x and 1.45x faster than the PagedAttention variants of FlashAttention and FlashInfer.

* 15 pages, 12 figures, 8 tables

View paper on

Share this with someone who'll enjoy it:

Title:vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

Paper and Code