Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving

Nov 27, 2024

Ao Shen, Zhiyao Li, Mingyu Gao

Figure 1 for FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving

Figure 2 for FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving

Figure 3 for FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving

Figure 4 for FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving

Share this with someone who'll enjoy it:

Abstract:Serving numerous users and requests concurrently requires good fairness in Large Language Models (LLMs) serving system. This ensures that, at the same cost, the system can meet the Service Level Objectives (SLOs) of more users , such as time to first token (TTFT) and time between tokens (TBT), rather than allowing a few users to experience performance far exceeding the SLOs. To achieve better fairness, the preemption-based scheduling policy dynamically adjusts the priority of each request to maintain balance during runtime. However, existing systems tend to overly prioritize throughput, overlooking the overhead caused by preemption-induced context switching, which is crucial for maintaining fairness through priority adjustments. In this work, we identify three main challenges that result in this overhead. 1) Inadequate I/O utilization. 2) GPU idleness. 3) Unnecessary I/O transmission during multi-turn conversations. Our key insight is that the block-based KV cache memory policy in existing systems, while achieving near-zero memory waste, leads to discontinuity and insufficient granularity in the KV cache memory. To respond, we introduce FastSwitch, a fairness-aware serving system that not only aligns with existing KV cache memory allocation policy but also mitigates context switching overhead. Our evaluation shows that FastSwitch outperforms the state-of-the-art LLM serving system vLLM with speedups of 1.4-11.2x across different tail TTFT and TBT.

View paper on

Share this with someone who'll enjoy it:

Title:FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving

Paper and Code