Picture for Shane Bergsma

Shane Bergsma

Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs

Add code
Feb 21, 2025
Viaarxiv icon

Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers

Add code
Nov 01, 2024
Viaarxiv icon

Sparse maximal update parameterization: A holistic approach to sparse training dynamics

Add code
May 24, 2024
Viaarxiv icon

C2FAR: Coarse-to-Fine Autoregressive Networks for Precise Probabilistic Forecasting

Add code
Dec 22, 2023
Viaarxiv icon

SutraNets: Sub-series Autoregressive Networks for Long-Sequence, Probabilistic Forecasting

Add code
Dec 22, 2023
Viaarxiv icon