Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Apr 22, 2024

Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica

Figure 1 for Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Figure 2 for Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Figure 3 for Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Figure 4 for Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Share this with someone who'll enjoy it:

Abstract:Large language models (LLMs) are increasingly integrated into many online services. However, a major challenge in deploying LLMs is their high cost, due primarily to the use of expensive GPU instances. To address this problem, we find that the significant heterogeneity of GPU types presents an opportunity to increase GPU cost efficiency and reduce deployment costs. The broad and growing market of GPUs creates a diverse option space with varying costs and hardware specifications. Within this space, we show that there is not a linear relationship between GPU cost and performance, and identify three key LLM service characteristics that significantly affect which GPU type is the most cost effective: model request size, request rate, and latency service-level objective (SLO). We then present M\'elange, a framework for navigating the diversity of GPUs and LLM service specifications to derive the most cost-efficient set of GPUs for a given LLM service. We frame the task of GPU selection as a cost-aware bin-packing problem, where GPUs are bins with a capacity and cost, and items are request slices defined by a request size and rate. Upon solution, M\'elange derives the minimal-cost GPU allocation that adheres to a configurable latency SLO. Our evaluations across both real-world and synthetic datasets demonstrate that M\'elange can reduce deployment costs by up to 77% as compared to utilizing only a single GPU type, highlighting the importance of making heterogeneity-aware GPU provisioning decisions for LLM serving. Our source code is publicly available at https://github.com/tyler-griggs/melange-release.

View paper on

Share this with someone who'll enjoy it:

Title:Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Paper and Code