Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning

Feb 24, 2024

Yong Liu, Zirui Zhu, Chaoyu Gong, Minhao Cheng, Cho-Jui Hsieh, Yang You

Figure 1 for Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning

Figure 2 for Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning

Figure 3 for Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning

Figure 4 for Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning

Share this with someone who'll enjoy it:

Abstract:While fine-tuning large language models (LLMs) for specific tasks often yields impressive results, it comes at the cost of memory inefficiency due to back-propagation in gradient-based training. Memory-efficient Zeroth-order (MeZO) optimizers, recently proposed to address this issue, only require forward passes during training, making them more memory-friendly. However, the quality of gradient estimates in zeroth order optimization often depends on the data dimensionality, potentially explaining why MeZO still exhibits significant performance drops compared to standard fine-tuning across various tasks. Inspired by the success of Parameter-Efficient Fine-Tuning (PEFT), this paper introduces Sparse MeZO, a novel memory-efficient zeroth-order optimization approach that applies ZO only to a carefully chosen subset of parameters. We propose a simple yet effective parameter selection scheme that yields significant performance gains with Sparse-MeZO. Additionally, we develop a memory-optimized implementation for sparse masking, ensuring the algorithm requires only inference-level memory consumption, allowing Sparse-MeZO to fine-tune LLaMA-30b on a single A100 GPU. Experimental results illustrate that Sparse-MeZO consistently improves both performance and convergence speed over MeZO without any overhead. For example, it achieves a 9\% absolute accuracy improvement and 3.5x speedup over MeZO on the RTE task.

View paper on

Share this with someone who'll enjoy it:

Title:Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning

Paper and Code