Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tristan Webb

Understanding the difficulty of low-precision post-training quantization of large language models

Oct 18, 2024

Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan Webb, Xin Wang

Figure 1 for Understanding the difficulty of low-precision post-training quantization of large language models

Figure 2 for Understanding the difficulty of low-precision post-training quantization of large language models

Figure 3 for Understanding the difficulty of low-precision post-training quantization of large language models

Figure 4 for Understanding the difficulty of low-precision post-training quantization of large language models

Abstract:Large language models of high parameter counts are computationally expensive, yet can be made much more efficient by compressing their weights to very low numerical precision. This can be achieved either through post-training quantization by minimizing local, layer-wise quantization errors, or through quantization-aware fine-tuning by minimizing the global loss function. In this study, we discovered that, under the same data constraint, the former approach nearly always fared worse than the latter, a phenomenon particularly prominent when the numerical precision is very low. We further showed that this difficulty of post-training quantization arose from stark misalignment between optimization of the local and global objective functions. Our findings explains limited utility in minimization of local quantization error and the importance of direct quantization-aware fine-tuning, in the regime of large models at very low precision.

Via

Access Paper or Ask Questions

Scaling laws for post-training quantized large language models

Oct 15, 2024

Zifei Xu, Alexander Lan, Wanzin Yazar, Tristan Webb, Sayeh Sharify, Xin Wang

Figure 1 for Scaling laws for post-training quantized large language models

Figure 2 for Scaling laws for post-training quantized large language models

Figure 3 for Scaling laws for post-training quantized large language models

Figure 4 for Scaling laws for post-training quantized large language models

Abstract:Generalization abilities of well-trained large language models (LLMs) are known to scale predictably as a function of model size. In contrast to the existence of practical scaling laws governing pre-training, the quality of LLMs after post-training compression remains highly unpredictable, often requiring case-by-case validation in practice. In this work, we attempted to close this gap for post-training weight quantization of LLMs by conducting a systematic empirical study on multiple LLM families quantized to numerous low-precision tensor data types using popular weight quantization techniques. We identified key scaling factors pertaining to characteristics of the local loss landscape, based on which the performance of quantized LLMs can be reasonably well predicted by a statistical model.

Via

Access Paper or Ask Questions

A Hardware-Aware System for Accelerating Deep Neural Network Optimization

Feb 25, 2022

Anthony Sarah, Daniel Cummings, Sharath Nittur Sridhar, Sairam Sundaresan, Maciej Szankin, Tristan Webb, J. Pablo Munoz

Figure 1 for A Hardware-Aware System for Accelerating Deep Neural Network Optimization

Figure 2 for A Hardware-Aware System for Accelerating Deep Neural Network Optimization

Figure 3 for A Hardware-Aware System for Accelerating Deep Neural Network Optimization

Figure 4 for A Hardware-Aware System for Accelerating Deep Neural Network Optimization

Abstract:Recent advances in Neural Architecture Search (NAS) which extract specialized hardware-aware configurations (a.k.a. "sub-networks") from a hardware-agnostic "super-network" have become increasingly popular. While considerable effort has been employed towards improving the first stage, namely, the training of the super-network, the search for derivative high-performing sub-networks is still largely under-explored. For example, some recent network morphism techniques allow a super-network to be trained once and then have hardware-specific networks extracted from it as needed. These methods decouple the super-network training from the sub-network search and thus decrease the computational burden of specializing to different hardware platforms. We propose a comprehensive system that automatically and efficiently finds sub-networks from a pre-trained super-network that are optimized to different performance metrics and hardware configurations. By combining novel search tactics and algorithms with intelligent use of predictors, we significantly decrease the time needed to find optimal sub-networks from a given super-network. Further, our approach does not require the super-network to be refined for the target task a priori, thus allowing it to interface with any super-network. We demonstrate through extensive experiments that our system works seamlessly with existing state-of-the-art super-network training methods in multiple domains. Moreover, we show how novel search tactics paired with evolutionary algorithms can accelerate the search process for ResNet50, MobileNetV3 and Transformer while maintaining objective space Pareto front diversity and demonstrate an 8x faster search result than the state-of-the-art Bayesian optimization WeakNAS approach.

Via

Access Paper or Ask Questions