Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alexander Hägele

The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training

Jan 31, 2025

Fabian Schaipp, Alexander Hägele, Adrien Taylor, Umut Simsekli, Francis Bach

Figure 1 for The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training

Figure 2 for The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training

Figure 3 for The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training

Figure 4 for The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training

Abstract:We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant schedule with linear cooldown; in particular, the practical benefit of cooldown is reflected in the bound due to the absence of logarithmic terms. Further, we show that this surprisingly close match between optimization theory and practice can be exploited for learning-rate tuning: we achieve noticeable improvements for training 124M and 210M Llama-type models by (i) extending the schedule for continued training with optimal learning-rate, and (ii) transferring the optimal learning-rate across schedules.

Via

Access Paper or Ask Questions

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

May 29, 2024

Alexander Hägele, Elie Bakouch, Atli Kosson, Loubna Ben Allal, Leandro Von Werra, Martin Jaggi

Abstract:Scale has become a main ingredient in obtaining strong machine learning models. As a result, understanding a model's scaling properties is key to effectively designing both the right training setup as well as future generations of architectures. In this work, we argue that scale and training research has been needlessly complex due to reliance on the cosine schedule, which prevents training across different lengths for the same model size. We investigate the training behavior of a direct alternative - constant learning rate and cooldowns - and find that it scales predictably and reliably similar to cosine. Additionally, we show that stochastic weight averaging yields improved performance along the training trajectory, without additional training costs, across different scales. Importantly, with these findings we demonstrate that scaling experiments can be performed with significantly reduced compute and GPU hours by utilizing fewer but reusable training runs. Our code is available at https://github.com/epfml/schedules-and-scaling.

Via

Access Paper or Ask Questions

BaCaDI: Bayesian Causal Discovery with Unknown Interventions

Jun 03, 2022

Alexander Hägele, Jonas Rothfuss, Lars Lorch, Vignesh Ram Somnath, Bernhard Schölkopf, Andreas Krause

Figure 1 for BaCaDI: Bayesian Causal Discovery with Unknown Interventions

Figure 2 for BaCaDI: Bayesian Causal Discovery with Unknown Interventions

Figure 3 for BaCaDI: Bayesian Causal Discovery with Unknown Interventions

Figure 4 for BaCaDI: Bayesian Causal Discovery with Unknown Interventions

Abstract:Learning causal structures from observation and experimentation is a central task in many domains. For example, in biology, recent advances allow us to obtain single-cell expression data under multiple interventions such as drugs or gene knockouts. However, a key challenge is that often the targets of the interventions are uncertain or unknown. Thus, standard causal discovery methods can no longer be used. To fill this gap, we propose a Bayesian framework (BaCaDI) for discovering the causal structure that underlies data generated under various unknown experimental/interventional conditions. BaCaDI is fully differentiable and operates in the continuous space of latent probabilistic representations of both causal structures and interventions. This enables us to approximate complex posteriors via gradient-based variational inference and to reason about the epistemic uncertainty in the predicted structure. In experiments on synthetic causal discovery tasks and simulated gene-expression data, BaCaDI outperforms related methods in identifying causal structures and intervention targets. Finally, we demonstrate that, thanks to its rigorous Bayesian approach, our method provides well-calibrated uncertainty estimates.

Via

Access Paper or Ask Questions