Picture for Joel Hestness

Joel Hestness

Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs

Add code
Feb 21, 2025
Viaarxiv icon

Crystal: Illuminating LLM Abilities on Language and Code

Add code
Nov 06, 2024
Figure 1 for Crystal: Illuminating LLM Abilities on Language and Code
Figure 2 for Crystal: Illuminating LLM Abilities on Language and Code
Figure 3 for Crystal: Illuminating LLM Abilities on Language and Code
Figure 4 for Crystal: Illuminating LLM Abilities on Language and Code
Viaarxiv icon

Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers

Add code
Nov 01, 2024
Viaarxiv icon

Bilingual Adaptation of Monolingual Foundation Models

Add code
Jul 13, 2024
Figure 1 for Bilingual Adaptation of Monolingual Foundation Models
Figure 2 for Bilingual Adaptation of Monolingual Foundation Models
Figure 3 for Bilingual Adaptation of Monolingual Foundation Models
Figure 4 for Bilingual Adaptation of Monolingual Foundation Models
Viaarxiv icon

Sparse maximal update parameterization: A holistic approach to sparse training dynamics

Add code
May 24, 2024
Viaarxiv icon

MediSwift: Efficient Sparse Pre-trained Biomedical Language Models

Add code
Mar 01, 2024
Figure 1 for MediSwift: Efficient Sparse Pre-trained Biomedical Language Models
Figure 2 for MediSwift: Efficient Sparse Pre-trained Biomedical Language Models
Figure 3 for MediSwift: Efficient Sparse Pre-trained Biomedical Language Models
Figure 4 for MediSwift: Efficient Sparse Pre-trained Biomedical Language Models
Viaarxiv icon

Position Interpolation Improves ALiBi Extrapolation

Add code
Oct 18, 2023
Viaarxiv icon

BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

Add code
Sep 20, 2023
Figure 1 for BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model
Figure 2 for BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model
Figure 3 for BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model
Figure 4 for BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model
Viaarxiv icon

SlimPajama-DC: Understanding Data Combinations for LLM Training

Add code
Sep 19, 2023
Viaarxiv icon

Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster

Add code
Apr 06, 2023
Viaarxiv icon