Abstract:Meta-learning that uses implicit gradient have provided an exciting alternative to standard techniques which depend on the trajectory of the inner loop training. Implicit meta-learning (IML), however, require computing $2^{nd}$ order gradients, particularly the Hessian which is impractical to compute for modern deep learning models. Various approximations for the Hessian were proposed but a systematic comparison of their compute cost, stability, generalization of solution found and estimation accuracy were largely overlooked. In this study, we start by conducting a systematic comparative analysis of the various approximation methods and their effect when incorporated into IML training routines. We establish situations where catastrophic forgetting is exhibited in IML and explain their cause in terms of the inability of the approximations to estimate the curvature at convergence points. Sources of IML training instability are demonstrated and remedied. A detailed analysis of the effeciency of various inverse Hessian-vector product approximation methods is also provided. Subsequently, we use the insights gained to propose and evaluate a novel semi-supervised learning algorithm that learns to inductively weigh consistency regularization losses. We show how training a "Confidence Network" to extract domain specific features can learn to up-weigh useful images and down-weigh out-of-distribution samples. Results outperform the baseline FixMatch performance.
Abstract:We analyze VeLO (versatile learned optimizer), the largest scale attempt to train a general purpose "foundational" optimizer to date. VeLO was trained on thousands of machine learning tasks using over 4000 TPU months with the goal of producing an optimizer capable of generalizing to new problems while being hyperparameter free, and outperforming industry standards such as Adam. We independently evaluate VeLO on the MLCommons optimizer benchmark suite. We find that, contrary to initial claims: (1) VeLO has a critical hyperparameter that needs problem-specific tuning, (2) VeLO does not necessarily outperform competitors in quality of solution found, and (3) VeLO is not faster than competing optimizers at reducing the training loss. These observations call into question VeLO's generality and the value of the investment in training it.