Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Pushing the Limits of Large Language Model Quantization via the Linearity Theorem

Nov 26, 2024

Vladimir Malinovskii, Andrei Panferov, Ivan Ilin, Han Guo, Peter Richtárik, Dan Alistarh

Figure 1 for Pushing the Limits of Large Language Model Quantization via the Linearity Theorem

Figure 2 for Pushing the Limits of Large Language Model Quantization via the Linearity Theorem

Figure 3 for Pushing the Limits of Large Language Model Quantization via the Linearity Theorem

Figure 4 for Pushing the Limits of Large Language Model Quantization via the Linearity Theorem

Share this with someone who'll enjoy it:

Abstract:Quantizing large language models has become a standard way to reduce their memory and computational costs. Typically, existing methods focus on breaking down the problem into individual layer-wise sub-problems, and minimizing per-layer error, measured via various metrics. Yet, this approach currently lacks theoretical justification and the metrics employed may be sub-optimal. In this paper, we present a "linearity theorem" establishing a direct relationship between the layer-wise $\ell_2$ reconstruction error and the model perplexity increase due to quantization. This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, which outperforms all prior data-free approaches such as the extremely popular NF4 quantized format, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels which match a given compression constraint in the medium-bitwidth regime, obtained by reduction to dynamic programming. On the practical side, we demonstrate improved accuracy-compression trade-offs on Llama-3.1 and 3.2-family models, as well as on Qwen-family models. Further, we show that our method can be efficiently supported in terms of GPU kernels at various batch sizes, advancing both data-free and non-uniform quantization for LLMs.

View paper on

Share this with someone who'll enjoy it:

Title:Pushing the Limits of Large Language Model Quantization via the Linearity Theorem

Paper and Code