Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Compute-Optimal LLMs Provably Generalize Better With Scale

Apr 21, 2025

Marc Finzi, Sanyam Kapoor, Diego Granziol, Anming Gu, Christopher De Sa, J. Zico Kolter, Andrew Gordon Wilson

Figure 1 for Compute-Optimal LLMs Provably Generalize Better With Scale

Figure 2 for Compute-Optimal LLMs Provably Generalize Better With Scale

Figure 3 for Compute-Optimal LLMs Provably Generalize Better With Scale

Figure 4 for Compute-Optimal LLMs Provably Generalize Better With Scale

Share this with someone who'll enjoy it:

Abstract:Why do larger language models generalize better? To investigate this question, we develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime, as described by the Chinchilla scaling laws. We introduce a novel, fully empirical Freedman-type martingale concentration inequality that tightens existing bounds by accounting for the variance of the loss function. This generalization bound can be decomposed into three interpretable components: the number of parameters per token, the loss variance, and the quantization error at a fixed bitrate. As compute-optimal language models are scaled up, the number of parameters per data point remains constant; however, both the loss variance and the quantization error decrease, implying that larger models should have smaller generalization gaps. We examine why larger models tend to be more quantizable from an information theoretic perspective, showing that the rate at which they can integrate new information grows more slowly than their capacity on the compute-optimal frontier. From these findings we produce a scaling law for the generalization gap, with bounds that become predictably stronger with scale.

* ICLR 2025

View paper on

Share this with someone who'll enjoy it:

Title:Compute-Optimal LLMs Provably Generalize Better With Scale

Paper and Code