Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:SWAN: Preprocessing SGD Enables Adam-Level Performance On LLM Training With Significant Memory Reduction

Dec 17, 2024

Chao Ma, Wenbo Gong, Meyer Scetbon, Edward Meeds

Figure 1 for SWAN: Preprocessing SGD Enables Adam-Level Performance On LLM Training With Significant Memory Reduction

Figure 2 for SWAN: Preprocessing SGD Enables Adam-Level Performance On LLM Training With Significant Memory Reduction

Figure 3 for SWAN: Preprocessing SGD Enables Adam-Level Performance On LLM Training With Significant Memory Reduction

Figure 4 for SWAN: Preprocessing SGD Enables Adam-Level Performance On LLM Training With Significant Memory Reduction

Share this with someone who'll enjoy it:

Abstract:Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of large language models. However, they maintain additional moving average states throughout training, which results in memory requirements several times greater than the model. This overhead imposes constraints on scalability and computational efficiency. On the other hand, while stochastic gradient descent (SGD) is optimal in terms of memory efficiency, their capability in LLM training is limited (Zhao et al., 2024b). To address this dilemma, we show that pre-processing SGD is sufficient to reach Adam-level performance on LLMs. Specifically, we propose to preprocess the instantaneous stochastic gradients with two simple operators: $\mathtt{GradNorm}$ and $\mathtt{GradWhitening}$. $\mathtt{GradNorm}$ stabilizes gradient distributions, and $\mathtt{GradWhitening}$ counteracts the local curvature of the loss landscape, respectively. This results in SWAN (SGD with Whitening And Normalization), a stochastic optimizer that eliminates the need to store any accumulative state variables. Empirically, SWAN has the same memory footprint as SGD, achieving $\approx 50\%$ reduction on total end-to-end memory compared to Adam. In language modeling tasks, SWAN demonstrates the same or even a substantial improvement over Adam. Specifically, when pre-training the LLaMa model with 350M and 1.3B parameters, SWAN achieves a 2x speedup by reaching the same evaluation perplexity in less than half tokens seen.

View paper on

Share this with someone who'll enjoy it:

Title:SWAN: Preprocessing SGD Enables Adam-Level Performance On LLM Training With Significant Memory Reduction

Paper and Code