Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tingxuan Zhong

Towards End-to-end 4-Bit Inference on Generative Large Language Models

Oct 13, 2023

Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, Dan Alistarh

Figure 1 for Towards End-to-end 4-Bit Inference on Generative Large Language Models

Figure 2 for Towards End-to-end 4-Bit Inference on Generative Large Language Models

Figure 3 for Towards End-to-end 4-Bit Inference on Generative Large Language Models

Figure 4 for Towards End-to-end 4-Bit Inference on Generative Large Language Models

Abstract:We show that the majority of the inference computations for large generative models such as LLaMA and OPT can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups while at the same time maintaining good accuracy. We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit, while keeping some outlier weights and activations in higher-precision. Crucially, our scheme is designed with computational efficiency in mind: we provide GPU kernels with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.1x relative to FP16 execution. Code and models are provided at https://github.com/IST-DASLab/QUIK.

* 9 pages

Via

Access Paper or Ask Questions