Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

David J. Schwab

Data coarse graining can improve model performance

Sep 18, 2025

Alex Nguyen, David J. Schwab, Vudtiwat Ngampruetikorn

Abstract:Lossy data transformations by definition lose information. Yet, in modern machine learning, methods like data pruning and lossy data augmentation can help improve generalization performance. We study this paradox using a solvable model of high-dimensional, ridge-regularized linear regression under 'data coarse graining.' Inspired by the renormalization group in statistical physics, we analyze coarse-graining schemes that systematically discard features based on their relevance to the learning task. Our results reveal a nonmonotonic dependence of the prediction risk on the degree of coarse graining. A 'high-pass' scheme--which filters out less relevant, lower-signal features--can help models generalize better. By contrast, a 'low-pass' scheme that integrates out more relevant, higher-signal features is purely detrimental. Crucially, using optimal regularization, we demonstrate that this nonmonotonicity is a distinct effect of data coarse graining and not an artifact of double descent. Our framework offers a clear, analytical explanation for why careful data augmentation works: it strips away less relevant degrees of freedom and isolates more predictive signals. Our results highlight a complex, nonmonotonic risk landscape shaped by the structure of the data, and illustrate how ideas from statistical physics provide a principled lens for understanding modern machine learning phenomena.

* 7 pages, 4 figures

Via

Access Paper or Ask Questions

When can in-context learning generalize out of task distribution?

Jun 05, 2025

Chase Goddard, Lindsay M. Smith, Vudtiwat Ngampruetikorn, David J. Schwab

Abstract:In-context learning (ICL) is a remarkable capability of pretrained transformers that allows models to generalize to unseen tasks after seeing only a few examples. We investigate empirically the conditions necessary on the pretraining distribution for ICL to emerge and generalize \emph{out-of-distribution}. Previous work has focused on the number of distinct tasks necessary in the pretraining dataset. Here, we use a different notion of task diversity to study the emergence of ICL in transformers trained on linear functions. We find that as task diversity increases, transformers undergo a transition from a specialized solution, which exhibits ICL only within the pretraining task distribution, to a solution which generalizes out of distribution to the entire task space. We also investigate the nature of the solutions learned by the transformer on both sides of the transition, and observe similar transitions in nonlinear regression problems. We construct a phase diagram to characterize how our concept of task diversity interacts with the number of pretraining tasks. In addition, we explore how factors such as the depth of the model and the dimensionality of the regression problem influence the transition.

Via

Access Paper or Ask Questions

Generalized Information Bottleneck for Gaussian Variables

Mar 31, 2023

Vudtiwat Ngampruetikorn, David J. Schwab

Figure 1 for Generalized Information Bottleneck for Gaussian Variables

Figure 2 for Generalized Information Bottleneck for Gaussian Variables

Abstract:The information bottleneck (IB) method offers an attractive framework for understanding representation learning, however its applications are often limited by its computational intractability. Analytical characterization of the IB method is not only of practical interest, but it can also lead to new insights into learning phenomena. Here we consider a generalized IB problem, in which the mutual information in the original IB method is replaced by correlation measures based on Renyi and Jeffreys divergences. We derive an exact analytical IB solution for the case of Gaussian correlated variables. Our analysis reveals a series of structural transitions, similar to those previously observed in the original IB case. We find further that although solving the original, Renyi and Jeffreys IB problems yields different representations in general, the structural transitions occur at the same critical tradeoff parameters, and the Renyi and Jeffreys IB solutions perform well under the original IB objective. Our results suggest that formulating the IB method with alternative correlation measures could offer a strategy for obtaining an approximate solution to the original IB problem.

* 7 pages, 3 figures

Via

Access Paper or Ask Questions

Information bottleneck theory of high-dimensional regression: relevancy, efficiency and optimality

Aug 08, 2022

Vudtiwat Ngampruetikorn, David J. Schwab

Figure 1 for Information bottleneck theory of high-dimensional regression: relevancy, efficiency and optimality

Figure 2 for Information bottleneck theory of high-dimensional regression: relevancy, efficiency and optimality

Figure 3 for Information bottleneck theory of high-dimensional regression: relevancy, efficiency and optimality

Figure 4 for Information bottleneck theory of high-dimensional regression: relevancy, efficiency and optimality

Abstract:Avoiding overfitting is a central challenge in machine learning, yet many large neural networks readily achieve zero training loss. This puzzling contradiction necessitates new approaches to the study of overfitting. Here we quantify overfitting via residual information, defined as the bits in fitted models that encode noise in training data. Information efficient learning algorithms minimize residual information while maximizing the relevant bits, which are predictive of the unknown generative models. We solve this optimization to obtain the information content of optimal algorithms for a linear regression problem and compare it to that of randomized ridge regression. Our results demonstrate the fundamental tradeoff between residual and relevant information and characterize the relative information efficiency of randomized regression with respect to optimal algorithms. Finally, using results from random matrix theory, we reveal the information complexity of learning a linear map in high dimensions and unveil information-theoretic analogs of double and multiple descent phenomena.

* 11 pages, 4 figures

Via

Access Paper or Ask Questions

Perturbation Theory for the Information Bottleneck

May 28, 2021

Vudtiwat Ngampruetikorn, David J. Schwab

Figure 1 for Perturbation Theory for the Information Bottleneck

Abstract:Extracting relevant information from data is crucial for all forms of learning. The information bottleneck (IB) method formalizes this, offering a mathematically precise and conceptually appealing framework for understanding learning phenomena. However the nonlinearity of the IB problem makes it computationally expensive and analytically intractable in general. Here we derive a perturbation theory for the IB method and report the first complete characterization of the learning onset, the limit of maximum relevant information per bit extracted from data. We test our results on synthetic probability distributions, finding good agreement with the exact numerical solution near the onset of learning. We explore the difference and subtleties in our derivation and previous attempts at deriving a perturbation theory for the learning onset and attribute the discrepancy to a flawed assumption. Our work also provides a fresh perspective on the intimate relationship between the IB method and the strong data processing inequality.

* 8 pages, 3 figures

Via

Access Paper or Ask Questions

Leveraging background augmentations to encourage semantic focus in self-supervised contrastive learning

Mar 23, 2021

Chaitanya K. Ryali, David J. Schwab, Ari S. Morcos

Figure 1 for Leveraging background augmentations to encourage semantic focus in self-supervised contrastive learning

Figure 2 for Leveraging background augmentations to encourage semantic focus in self-supervised contrastive learning

Figure 3 for Leveraging background augmentations to encourage semantic focus in self-supervised contrastive learning

Figure 4 for Leveraging background augmentations to encourage semantic focus in self-supervised contrastive learning

Abstract:Unsupervised representation learning is an important challenge in computer vision, with self-supervised learning methods recently closing the gap to supervised representation learning. An important ingredient in high-performing self-supervised methods is the use of data augmentation by training models to place different augmented views of the same image nearby in embedding space. However, commonly used augmentation pipelines treat images holistically, disregarding the semantic relevance of parts of an image-e.g. a subject vs. a background-which can lead to the learning of spurious correlations. Our work addresses this problem by investigating a class of simple, yet highly effective "background augmentations", which encourage models to focus on semantically-relevant content by discouraging them from focusing on image backgrounds. Background augmentations lead to substantial improvements (+1-2% on ImageNet-1k) in performance across a spectrum of state-of-the art self-supervised methods (MoCov2, BYOL, SwAV) on a variety of tasks, allowing us to reach within 0.3% of supervised performance. We also demonstrate that background augmentations improve robustness to a number of out of distribution settings, including natural adversarial examples, the backgrounds challenge, adversarial attacks, and ReaL ImageNet.

* Technical Report

Via

Access Paper or Ask Questions

Are all negatives created equal in contrastive instance discrimination?

Oct 25, 2020

Tiffany Tianhui Cai, Jonathan Frankle, David J. Schwab, Ari S. Morcos

Figure 1 for Are all negatives created equal in contrastive instance discrimination?

Figure 2 for Are all negatives created equal in contrastive instance discrimination?

Figure 3 for Are all negatives created equal in contrastive instance discrimination?

Figure 4 for Are all negatives created equal in contrastive instance discrimination?

Abstract:Self-supervised learning has recently begun to rival supervised learning on computer vision tasks. Many of the recent approaches have been based on contrastive instance discrimination (CID), in which the network is trained to recognize two augmented versions of the same instance (a query and positive) while discriminating against a pool of other instances (negatives). The learned representation is then used on downstream tasks such as image classification. Using methodology from MoCo v2 (Chen et al., 2020), we divided negatives by their difficulty for a given query and studied which difficulty ranges were most important for learning useful representations. We found a minority of negatives -- the hardest 5% -- were both necessary and sufficient for the downstream task to reach nearly full accuracy. Conversely, the easiest 95% of negatives were unnecessary and insufficient. Moreover, the very hardest 0.1% of negatives were unnecessary and sometimes detrimental. Finally, we studied the properties of negatives that affect their hardness, and found that hard negatives were more semantically similar to the query, and that some negatives were more consistently easy or hard than we would expect by chance. Together, our results indicate that negatives vary in importance and that CID may benefit from more intelligent negative treatment.

* Fixed author name error

Via

Access Paper or Ask Questions

Learning Optimal Representations with the Decodable Information Bottleneck

Sep 27, 2020

Yann Dubois, Douwe Kiela, David J. Schwab, Ramakrishna Vedantam

Figure 1 for Learning Optimal Representations with the Decodable Information Bottleneck

Figure 2 for Learning Optimal Representations with the Decodable Information Bottleneck

Figure 3 for Learning Optimal Representations with the Decodable Information Bottleneck

Figure 4 for Learning Optimal Representations with the Decodable Information Bottleneck

Abstract:We address the question of characterizing and finding optimal representations for supervised learning. Traditionally, this question has been tackled using the Information Bottleneck, which compresses the inputs while retaining information about the targets, in a decoder-agnostic fashion. In machine learning, however, our goal is not compression but rather generalization, which is intimately linked to the predictive family or decoder of interest (e.g. linear classifier). We propose the Decodable Information Bottleneck (DIB) that considers information retention and compression from the perspective of the desired predictive family. As a result, DIB gives rise to representations that are optimal in terms of expected test performance and can be estimated with guarantees. Empirically, we show that the framework can be used to enforce a small generalization gap on downstream classifiers and to predict the generalization ability of neural networks.

* Accepted at NeurIPS 2020

Via

Access Paper or Ask Questions

Theory of gating in recurrent neural networks

Aug 29, 2020

Kamesh Krishnamurthy, Tankut Can, David J. Schwab

Figure 1 for Theory of gating in recurrent neural networks

Figure 2 for Theory of gating in recurrent neural networks

Figure 3 for Theory of gating in recurrent neural networks

Figure 4 for Theory of gating in recurrent neural networks

Abstract:RNNs are popular dynamical models, used for processing sequential data. Prior theoretical work in understanding the properties of RNNs has focused on models with additive interactions, where the input to a unit is a weighted sum of the output of the remaining units in network. However, there is ample evidence that neurons can have gating - i.e. multiplicative - interactions. Such gating interactions have significant effects on the collective dynamics of the network. Furthermore, the best performing RNNs in machine learning have gating interactions. Thus, gating interactions are beneficial for information processing and learning tasks. We develop a dynamical mean-field theory (DMFT) of gating to understand the dynamical regimes produced by gating. Our gated RNN reduces to the classical RNNs in certain limits and is closely related to popular gated models in machine learning. We use random matrix theory (RMT) to analytically characterize the spectrum of the Jacobian and show how gating produces slow modes and marginal stability. Thus, gating is a potential mechanism to implement computations involving line attractor dynamics. The long-time behavior of the gated network is studied using its Lyapunov spectrum, and the DMFT is used to provide an analytical prediction for the maximum Lyapunov exponent. We also show that gating gives rise to a novel, discontinuous transition to chaos, where the proliferation of critical points is decoupled with the appearance of chaotic dynamics; the nature of this chaotic state is characterized in detail. Using the DMFT and RMT, we produce phase diagrams for gated RNN. Finally, we address the gradients by leveraging the adjoint sensitivity framework to develop a DMFT for the gradients. The theory developed here sheds light on the rich dynamical behaviour produced by gating interactions and has implications for architectural choices and learning dynamics.

Via

Access Paper or Ask Questions

Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs

Feb 29, 2020

Jonathan Frankle, David J. Schwab, Ari S. Morcos

Figure 1 for Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs

Figure 2 for Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs

Figure 3 for Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs

Figure 4 for Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs

Abstract:Batch normalization (BatchNorm) has become an indispensable tool for training deep neural networks, yet it is still poorly understood. Although previous work has typically focused on its normalization component, BatchNorm also adds two per-feature trainable parameters: a coefficient and a bias. However, the role and expressive power of these parameters remains unclear. To study this question, we investigate the performance achieved when training only these parameters and freezing all others at their random initializations. We find that doing so leads to surprisingly high performance. For example, a sufficiently deep ResNet reaches 83% accuracy on CIFAR-10 in this configuration. Interestingly, BatchNorm achieves this performance in part by naturally learning to disable around a third of the random features without any changes to the training objective. Not only do these results highlight the under-appreciated role of the affine parameters in BatchNorm, but - in a broader sense - they characterize the expressive power of neural networks constructed simply by shifting and rescaling random features.

Via

Access Paper or Ask Questions