Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Małgorzata Łazuka

LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services

Oct 03, 2024

Małgorzata Łazuka, Andreea Anghel, Thomas Parnell

Figure 1 for LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services

Figure 2 for LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services

Figure 3 for LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services

Figure 4 for LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services

Abstract:As Large Language Models (LLMs) are rapidly growing in popularity, LLM inference services must be able to serve requests from thousands of users while satisfying performance requirements. The performance of an LLM inference service is largely determined by the hardware onto which it is deployed, but understanding of which hardware will deliver on performance requirements remains challenging. In this work we present LLM-Pilot - a first-of-its-kind system for characterizing and predicting performance of LLM inference services. LLM-Pilot performs benchmarking of LLM inference services, under a realistic workload, across a variety of GPUs, and optimizes the service configuration for each considered GPU to maximize performance. Finally, using this characterization data, LLM-Pilot learns a predictive model, which can be used to recommend the most cost-effective hardware for a previously unseen LLM. Compared to existing methods, LLM-Pilot can deliver on performance requirements 33% more frequently, whilst reducing costs by 60% on average.

* Accepted to the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '24)

Via

Access Paper or Ask Questions

Search-based Methods for Multi-Cloud Configuration

Apr 20, 2022

Małgorzata Łazuka, Thomas Parnell, Andreea Anghel, Haralampos Pozidis

Figure 1 for Search-based Methods for Multi-Cloud Configuration

Figure 2 for Search-based Methods for Multi-Cloud Configuration

Figure 3 for Search-based Methods for Multi-Cloud Configuration

Figure 4 for Search-based Methods for Multi-Cloud Configuration

Abstract:Multi-cloud computing has become increasingly popular with enterprises looking to avoid vendor lock-in. While most cloud providers offer similar functionality, they may differ significantly in terms of performance and/or cost. A customer looking to benefit from such differences will naturally want to solve the multi-cloud configuration problem: given a workload, which cloud provider should be chosen and how should its nodes be configured in order to minimize runtime or cost? In this work, we consider solutions to this optimization problem. We develop and evaluate possible adaptations of state-of-the-art cloud configuration solutions to the multi-cloud domain. Furthermore, we identify an analogy between multi-cloud configuration and the selection-configuration problems commonly studied in the automated machine learning (AutoML) field. Inspired by this connection, we utilize popular optimizers from AutoML to solve multi-cloud configuration. Finally, we propose a new algorithm for solving multi-cloud configuration, CloudBandit (CB). It treats the outer problem of cloud provider selection as a best-arm identification problem, in which each arm pull corresponds to running an arbitrary black-box optimizer on the inner problem of node configuration. Our experiments indicate that (a) many state-of-the-art cloud configuration solutions can be adapted to multi-cloud, with best results obtained for adaptations which utilize the hierarchical structure of the multi-cloud configuration domain, (b) hierarchical methods from AutoML can be used for the multi-cloud configuration task and can outperform state-of-the-art cloud configuration solutions and (c) CB achieves competitive or lower regret relative to other tested algorithms, whilst also identifying configurations that have 65% lower median cost and 20% lower median time in production, compared to choosing a random provider and configuration.

* Submitted to IEEE Cloud 2022

Via

Access Paper or Ask Questions