Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:ReALLM: A general framework for LLM compression and fine-tuning

May 21, 2024

Louis Leconte, Lisa Bedin, Van Minh Nguyen, Eric Moulines

Figure 1 for ReALLM: A general framework for LLM compression and fine-tuning

Figure 2 for ReALLM: A general framework for LLM compression and fine-tuning

Figure 3 for ReALLM: A general framework for LLM compression and fine-tuning

Figure 4 for ReALLM: A general framework for LLM compression and fine-tuning

Share this with someone who'll enjoy it:

Abstract:We introduce ReALLM, a novel approach for compression and memory-efficient adaptation of pre-trained language models that encompasses most of the post-training quantization and fine-tuning methods for a budget of <4 bits. Pre-trained matrices are decomposed into a high-precision low-rank component and a vector-quantized latent representation (using an autoencoder). During the fine-tuning step, only the low-rank components are updated. Our results show that pre-trained matrices exhibit different patterns. ReALLM adapts the shape of the encoder (small/large embedding, high/low bit VQ, etc.) to each matrix. ReALLM proposes to represent each matrix with a small embedding on $b$ bits and a neural decoder model $\mathcal{D}_\phi$ with its weights on $b_\phi$ bits. The decompression of a matrix requires only one embedding and a single forward pass with the decoder. Our weight-only quantization algorithm yields the best results on language generation tasks (C4 and WikiText-2) for a budget of $3$ bits without any training. With a budget of $2$ bits, ReALLM achieves state-of-the art performance after fine-tuning on a small calibration dataset.

View paper on

Share this with someone who'll enjoy it:

Title:ReALLM: A general framework for LLM compression and fine-tuning

Paper and Code