Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Haojie Ye

Callie

Understanding the Performance and Estimating the Cost of LLM Fine-Tuning

Aug 08, 2024

Yuchen Xia, Jiho Kim, Yuhan Chen, Haojie Ye, Souvik Kundu, Cong, Hao, Nishil Talati

Figure 1 for Understanding the Performance and Estimating the Cost of LLM Fine-Tuning

Figure 2 for Understanding the Performance and Estimating the Cost of LLM Fine-Tuning

Figure 3 for Understanding the Performance and Estimating the Cost of LLM Fine-Tuning

Figure 4 for Understanding the Performance and Estimating the Cost of LLM Fine-Tuning

Abstract:Due to the cost-prohibitive nature of training Large Language Models (LLMs), fine-tuning has emerged as an attractive alternative for specializing LLMs for specific tasks using limited compute resources in a cost-effective manner. In this paper, we characterize sparse Mixture of Experts (MoE) based LLM fine-tuning to understand their accuracy and runtime performance on a single GPU. Our evaluation provides unique insights into the training efficacy of sparse and dense versions of MoE models, as well as their runtime characteristics, including maximum batch size, execution time breakdown, end-to-end throughput, GPU hardware utilization, and load distribution. Our study identifies the optimization of the MoE layer as crucial for further improving the performance of LLM fine-tuning. Using our profiling results, we also develop and validate an analytical model to estimate the cost of LLM fine-tuning on the cloud. This model, based on parameters of the model and GPU architecture, estimates LLM throughput and the cost of training, aiding practitioners in industry and academia to budget the cost of fine-tuning a specific model.

* 10 pages, conference

Via

Access Paper or Ask Questions

CoDR: Computation and Data Reuse Aware CNN Accelerator

Apr 20, 2021

Alireza Khadem, Haojie Ye, Trevor Mudge

Figure 1 for CoDR: Computation and Data Reuse Aware CNN Accelerator

Figure 2 for CoDR: Computation and Data Reuse Aware CNN Accelerator

Figure 3 for CoDR: Computation and Data Reuse Aware CNN Accelerator

Figure 4 for CoDR: Computation and Data Reuse Aware CNN Accelerator

Abstract:Computation and Data Reuse is critical for the resource-limited Convolutional Neural Network (CNN) accelerators. This paper presents Universal Computation Reuse to exploit weight sparsity, repetition, and similarity simultaneously in a convolutional layer. Moreover, CoDR decreases the cost of weight memory access by proposing a customized Run-Length Encoding scheme and the number of memory accesses to the intermediate results by introducing an input and output stationary dataflow. Compared to two recent compressed CNN accelerators with the same area of 2.85 mm^2, CoDR decreases SRAM access by 5.08x and 7.99x, and consumes 3.76x and 6.84x less energy.

Via

Access Paper or Ask Questions