Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ganhong Huang

GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments

Dec 06, 2024

Yanyu Chen, Ganhong Huang

Figure 1 for GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments

Figure 2 for GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments

Figure 3 for GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments

Figure 4 for GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments

Abstract:Efficiently deploying large language models (LLMs) in real-world scenarios remains a critical challenge, primarily due to hardware heterogeneity, inference framework limitations, and workload complexities.Efficiently deploying large language models (LLMs) in real-world scenarios remains a critical challenge, primarily due to hardware heterogeneity, inference framework limitations, and workload complexities. These challenges often lead to inefficiencies in memory utilization, latency, and throughput, hindering the effective deployment of LLMs, especially for non-experts. Through extensive experiments, we identify key performance bottlenecks, including sudden drops in memory utilization, latency fluctuations with varying batch sizes, and inefficiencies in multi-GPU configurations. These insights reveal a vast optimization space shaped by the intricate interplay of hardware, frameworks, and workload parameters. This underscores the need for a systematic approach to optimize LLM inference, motivating the design of our framework, GUIDE. GUIDE leverages dynamic modeling and simulation-based optimization to address these issues, achieving prediction errors between 25% and 55% for key metrics such as batch latency, TTFT, and decode throughput. By effectively bridging the gap between theoretical performance and practical deployment, our framework empowers practitioners, particularly non-specialists, to make data-driven decisions and unlock the full potential of LLMs in heterogeneous environments cheaply.

Via

Access Paper or Ask Questions