Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:MADA: Meta-Adaptive Optimizers through hyper-gradient Descent

Jan 17, 2024

Kaan Ozkara, Can Karakus, Parameswaran Raman, Mingyi Hong, Shoham Sabach, Branislav Kveton, Volkan Cevher

Figure 1 for MADA: Meta-Adaptive Optimizers through hyper-gradient Descent

Figure 2 for MADA: Meta-Adaptive Optimizers through hyper-gradient Descent

Figure 3 for MADA: Meta-Adaptive Optimizers through hyper-gradient Descent

Figure 4 for MADA: Meta-Adaptive Optimizers through hyper-gradient Descent

Share this with someone who'll enjoy it:

Abstract:Since Adam was introduced, several novel adaptive optimizers for deep learning have been proposed. These optimizers typically excel in some tasks but may not outperform Adam uniformly across all tasks. In this work, we introduce Meta-Adaptive Optimizers (MADA), a unified optimizer framework that can generalize several known optimizers and dynamically learn the most suitable one during training. The key idea in MADA is to parameterize the space of optimizers and search through it using hyper-gradient descent. Numerical results suggest that MADA is robust against sub-optimally tuned hyper-parameters, and outperforms Adam, Lion, and Adan with their default hyper-parameters, often even with optimized hyper-parameters. We also propose AVGrad, a variant of AMSGrad where the maximum operator is replaced with averaging, and observe that it performs better within MADA. Finally, we provide a convergence analysis to show that interpolation of optimizers (specifically, AVGrad and Adam) can improve their error bounds (up to constants), hinting at an advantage for meta-optimizers.

View paper on

Share this with someone who'll enjoy it:

Title:MADA: Meta-Adaptive Optimizers through hyper-gradient Descent

Paper and Code