Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving

Jan 14, 2025

Ahmet Caner Yüzügüler, Jiawei Zhuang, Lukas Cavigelli

Figure 1 for PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving

Figure 2 for PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving

Figure 3 for PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving

Figure 4 for PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving

Share this with someone who'll enjoy it:

Abstract:Large language models (LLMs) are widely used across various applications, but their substantial computational requirements pose significant challenges, particularly in terms of HBM bandwidth bottlenecks and inter-device communication overhead. In this paper, we present PRESERVE, a novel prefetching framework designed to optimize LLM inference by overlapping memory reads for model weights and KV-cache with collective communication operations. Through extensive experiments conducted on commercial AI accelerators, we demonstrate up to 1.6x end-to-end speedup on state-of-the-art, open-source LLMs. Additionally, we perform a design space exploration that identifies the optimal hardware configuration for the proposed method, showing a further 1.25x improvement in performance per cost by selecting the optimal L2 cache size. Our results show that PRESERVE has the potential to mitigate the memory bottlenecks and communication overheads, offering a solution to improve the performance and scalability of the LLM inference systems.

View paper on

Share this with someone who'll enjoy it:

Title:PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving

Paper and Code