Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models

May 28, 2024

Xing Hu, Yuan Chen, Dawei Yang, Sifan Zhou, Zhihang Yuan, Jiangyong Yu, Chen Xu

Figure 1 for I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models

Figure 2 for I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models

Figure 3 for I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models

Figure 4 for I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models

Share this with someone who'll enjoy it:

Abstract:Post-training quantization (PTQ) serves as a potent technique to accelerate the inference of large language models (LLMs). Nonetheless, existing works still necessitate a considerable number of floating-point (FP) operations during inference, including additional quantization and de-quantization, as well as non-linear operators such as RMSNorm and Softmax. This limitation hinders the deployment of LLMs on the edge and cloud devices. In this paper, we identify the primary obstacle to integer-only quantization for LLMs lies in the large fluctuation of activations across channels and tokens in both linear and non-linear operations. To address this issue, we propose I-LLM, a novel integer-only fully-quantized PTQ framework tailored for LLMs. Specifically, (1) we develop Fully-Smooth Block-Reconstruction (FSBR) to aggressively smooth inter-channel variations of all activations and weights. (2) to alleviate degradation caused by inter-token variations, we introduce a novel approach called Dynamic Integer-only MatMul (DI-MatMul). This method enables dynamic quantization in full-integer matrix multiplication by dynamically quantizing the input and outputs with integer-only operations. (3) we design DI-ClippedSoftmax, DI-Exp, and DI-Normalization, which utilize bit shift to execute non-linear operators efficiently while maintaining accuracy. The experiment shows that our I-LLM achieves comparable accuracy to the FP baseline and outperforms non-integer quantization methods. For example, I-LLM can operate at W4A4 with negligible loss of accuracy. To our knowledge, we are the first to bridge the gap between integer-only quantization and LLMs. We've published our code on anonymous.4open.science, aiming to contribute to the advancement of this field.

View paper on

Share this with someone who'll enjoy it:

Title:I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models

Paper and Code