Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Cameron J. Hogan

On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization

Jan 21, 2026

Sharan Sahu, Cameron J. Hogan, Martin T. Wells

Abstract:While momentum-based acceleration has been studied extensively in deterministic optimization problems, its behavior in nonstationary environments -- where the data distribution and optimal parameters drift over time -- remains underexplored. We analyze the tracking performance of Stochastic Gradient Descent (SGD) and its momentum variants (Polyak heavy-ball and Nesterov) under uniform strong convexity and smoothness in varying stepsize regimes. We derive finite-time bounds in expectation and with high probability for the tracking error, establishing a sharp decomposition into three components: a transient initialization term, a noise-induced variance term, and a drift-induced tracking lag. Crucially, our analysis uncovers a fundamental trade-off: while momentum can suppress gradient noise, it incurs an explicit penalty on the tracking capability. We show that momentum can substantially amplify drift-induced tracking error, with amplification that becomes unbounded as the momentum parameter approaches one, formalizing the intuition that using 'stale' gradients hinders adaptation to rapid regime shifts. Complementing these upper bounds, we establish minimax lower bounds for dynamic regret under gradient-variation constraints. These lower bounds prove that the inertia-induced penalty is not an artifact of analysis but an information-theoretic barrier: in drift-dominated regimes, momentum creates an unavoidable 'inertia window' that fundamentally degrades performance. Collectively, these results provide a definitive theoretical grounding for the empirical instability of momentum in dynamic environments and delineate the precise regime boundaries where SGD provably outperforms its accelerated counterparts.

* 70 pages, 4 figures, 2 tables

Via

Access Paper or Ask Questions

Data-adaptive Transfer Learning for Translation: A Case Study in Haitian and Jamaican

Sep 13, 2022

Nathaniel R. Robinson, Cameron J. Hogan, Nancy Fulda, David R. Mortensen

Figure 1 for Data-adaptive Transfer Learning for Translation: A Case Study in Haitian and Jamaican

Figure 2 for Data-adaptive Transfer Learning for Translation: A Case Study in Haitian and Jamaican

Figure 3 for Data-adaptive Transfer Learning for Translation: A Case Study in Haitian and Jamaican

Figure 4 for Data-adaptive Transfer Learning for Translation: A Case Study in Haitian and Jamaican

Abstract:Multilingual transfer techniques often improve low-resource machine translation (MT). Many of these techniques are applied without considering data characteristics. We show in the context of Haitian-to-English translation that transfer effectiveness is correlated with amount of training data and relationships between knowledge-sharing languages. Our experiments suggest that for some languages beyond a threshold of authentic data, back-translation augmentation methods are counterproductive, while cross-lingual transfer from a sufficiently related language is preferred. We complement this finding by contributing a rule-based French-Haitian orthographic and syntactic engine and a novel method for phonological embedding. When used with multilingual techniques, orthographic transformation makes statistically significant improvements over conventional methods. And in very low-resource Jamaican MT, code-switching with a transfer language for orthographic resemblance yields a 6.63 BLEU point advantage.

Via

Access Paper or Ask Questions