Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

Dec 18, 2024

Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang

Figure 1 for ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

Figure 2 for ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

Figure 3 for ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

Figure 4 for ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

Share this with someone who'll enjoy it:

Abstract:Post-training quantization (PTQ) of large language models (LLMs) holds the promise in reducing the prohibitive computational cost at inference time. Quantization of all weight, activation and key-value (KV) cache tensors to 4-bit without significantly degrading generalizability is challenging, due to the high quantization error caused by extreme outliers in activations. To tackle this problem, we propose ResQ, a PTQ method that pushes further the state-of-the-art. By means of principal component analysis (PCA), it identifies a low-rank subspace (in practice 1/8 of the hidden dimension) in which activation variances are highest, and keep the coefficients within this subspace in high precision, e.g. 8-bit, while quantizing the rest to 4-bit. Within each subspace, invariant random rotation is applied to further suppress outliers. We show that this is a provably optimal mixed precision quantization scheme that minimizes error. With the Llama families of models, we demonstrate that ResQ outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks, achieving up to 33% lower perplexity on Wikitext than the next best method SpinQuant, and a 2.4x speedup over 16-bit baseline. Code is available at https://github.com/utkarsh-dmx/project-resq.

* 14 pages, 6 figures, 6 tables

View paper on

Share this with someone who'll enjoy it:

Title:ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

Paper and Code