Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jae-joon Kim

L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ

Feb 15, 2024

Hyesung Jeon, Yulhwa Kim, Jae-joon Kim

Figure 1 for L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ

Figure 2 for L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ

Figure 3 for L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ

Figure 4 for L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ

Abstract:Post-training quantization (PTQ) and quantization-aware training (QAT) methods are gaining popularity in mitigating the high memory and computational costs associated with Large Language Models (LLMs). In resource-constrained scenarios, PTQ, with its reduced training overhead, is often preferred over QAT, despite the latter's potential for higher accuracy. Meanwhile, parameter-efficient fine-tuning (PEFT) methods like low-rank adaptation (LoRA) have been introduced, and recent efforts have explored quantization-aware PEFT techniques. However, these approaches may lack generality due to their reliance on the pre-quantized model's configuration. Their effectiveness may be compromised by non-linearly quantized or mixed-precision weights, and the retraining of specific quantization parameters might impede optimal performance. To address these challenges, we propose L4Q, an algorithm for parameter-efficient quantization-aware training. L4Q leverages LoRA-wise learned quantization step size for LLMs, aiming to enhance generality. The simultaneous quantization-and-fine-tuning process of L4Q is applicable to high-precision models, yielding linearly quantized weights with superior accuracy. Our experiments, conducted on the LLaMA and LLaMA2 model families using an instructional dataset, showcase L4Q's capabilities in language comprehension and few-shot in-context learning, achieving sub-4-bit precision while maintaining comparable training times to applying PEFT on a quantized model.

* 8 pages, 2 figures

Via

Access Paper or Ask Questions