Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Learned Best-Effort LLM Serving

Jan 15, 2024

Siddharth Jha, Coleman Hooper, Xiaoxuan Liu, Sehoon Kim, Kurt Keutzer

Figure 1 for Learned Best-Effort LLM Serving

Figure 2 for Learned Best-Effort LLM Serving

Figure 3 for Learned Best-Effort LLM Serving

Figure 4 for Learned Best-Effort LLM Serving

Share this with someone who'll enjoy it:

Abstract:Many applications must provide low-latency LLM service to users or risk unacceptable user experience. However, over-provisioning resources to serve fluctuating request patterns is often prohibitively expensive. In this work, we present a best-effort serving system that employs deep reinforcement learning to adjust service quality based on the task distribution and system load. Our best-effort system can maintain availability with over 10x higher client request rates, serves above 96% of peak performance 4.1x more often, and serves above 98% of peak performance 2.3x more often than static serving on unpredictable workloads. Our learned router is robust to shifts in both the arrival and task distribution. Compared to static serving, learned best-effort serving allows for cost-efficient serving through increased hardware utility. Additionally, we argue that learned best-effort LLM serving is applicable in wide variety of settings and provides application developers great flexibility to meet their specific needs.

View paper on

Share this with someone who'll enjoy it:

Title:Learned Best-Effort LLM Serving

Paper and Code