Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Jun 01, 2023

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, Song Han

Figure 1 for AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Figure 2 for AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Figure 3 for AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Figure 4 for AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Share this with someone who'll enjoy it:

Abstract:Large language models (LLMs) have shown excellent performance on various tasks, but the astronomical model size raises the hardware barrier for serving (memory size) and slows down token generation (memory bandwidth). In this paper, we propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. Our method is based on the observation that weights are not equally important: protecting only 1% of salient weights can greatly reduce quantization error. We then propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights. AWQ does not rely on any backpropagation or reconstruction, so it can well preserve LLMs' generalization ability on different domains and modalities, without overfitting to the calibration set; it also does not rely on any data layout reordering, maintaining the hardware efficiency. AWQ outperforms existing work on various language modeling, common sense QA, and domain-specific benchmarks. Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. We also implement efficient tensor core kernels with reorder-free online dequantization to accelerate AWQ, achieving a 1.45x speedup over GPTQ and is 1.85x faster than the cuBLAS FP16 implementation. Our method provides a turn-key solution to compress LLMs to 3/4 bits for efficient deployment.

* Code available at: https://github.com/mit-han-lab/llm-awq

View paper on

Share this with someone who'll enjoy it:

Title:AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Paper and Code