Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Joan Bruna

CIMS

The Generative Leap: Sharp Sample Complexity for Efficiently Learning Gaussian Multi-Index Models

Jun 05, 2025

Alex Damian, Jason D. Lee, Joan Bruna

Abstract:In this work we consider generic Gaussian Multi-index models, in which the labels only depend on the (Gaussian) $d$-dimensional inputs through their projection onto a low-dimensional $r = O_d(1)$ subspace, and we study efficient agnostic estimation procedures for this hidden subspace. We introduce the \emph{generative leap} exponent $k^\star$, a natural extension of the generative exponent from [Damian et al.'24] to the multi-index setting. We first show that a sample complexity of $n=\Theta(d^{1 \vee \k/2})$ is necessary in the class of algorithms captured by the Low-Degree-Polynomial framework. We then establish that this sample complexity is also sufficient, by giving an agnostic sequential estimation procedure (that is, requiring no prior knowledge of the multi-index model) based on a spectral U-statistic over appropriate Hermite tensors. We further compute the generative leap exponent for several examples including piecewise linear functions (deep ReLU networks with bias), and general deep neural networks (with $r$-dimensional first hidden layer).

Via

Access Paper or Ask Questions

Propagation of Chaos in One-hidden-layer Neural Networks beyond Logarithmic Time

Apr 17, 2025

Margalit Glasgow, Denny Wu, Joan Bruna

Figure 1 for Propagation of Chaos in One-hidden-layer Neural Networks beyond Logarithmic Time

Figure 2 for Propagation of Chaos in One-hidden-layer Neural Networks beyond Logarithmic Time

Figure 3 for Propagation of Chaos in One-hidden-layer Neural Networks beyond Logarithmic Time

Figure 4 for Propagation of Chaos in One-hidden-layer Neural Networks beyond Logarithmic Time

Abstract:We study the approximation gap between the dynamics of a polynomial-width neural network and its infinite-width counterpart, both trained using projected gradient descent in the mean-field scaling regime. We demonstrate how to tightly bound this approximation gap through a differential equation governed by the mean-field dynamics. A key factor influencing the growth of this ODE is the local Hessian of each particle, defined as the derivative of the particle's velocity in the mean-field dynamics with respect to its position. We apply our results to the canonical feature learning problem of estimating a well-specified single-index model; we permit the information exponent to be arbitrarily large, leading to convergence times that grow polynomially in the ambient dimension $d$. We show that, due to a certain ``self-concordance'' property in these problems -- where the local Hessian of a particle is bounded by a constant times the particle's velocity -- polynomially many neurons are sufficient to closely approximate the mean-field dynamics throughout training.

* 70 pages

Via

Access Paper or Ask Questions

Survey on Algorithms for multi-index models

Apr 07, 2025

Joan Bruna, Daniel Hsu

Abstract:We review the literature on algorithms for estimating the index space in a multi-index model. The primary focus is on computationally efficient (polynomial-time) algorithms in Gaussian space, the assumptions under which consistency is guaranteed by these methods, and their sample complexity. In many cases, a gap is observed between the sample complexity of the best known computationally efficient methods and the information-theoretical minimum. We also review algorithms based on estimating the span of gradients using nonparametric methods, and algorithms based on fitting neural networks using gradient descent

Via

Access Paper or Ask Questions

Thermalizer: Stable autoregressive neural emulation of spatiotemporal chaos

Mar 24, 2025

Chris Pedersen, Laure Zanna, Joan Bruna

Figure 1 for Thermalizer: Stable autoregressive neural emulation of spatiotemporal chaos

Figure 2 for Thermalizer: Stable autoregressive neural emulation of spatiotemporal chaos

Figure 3 for Thermalizer: Stable autoregressive neural emulation of spatiotemporal chaos

Figure 4 for Thermalizer: Stable autoregressive neural emulation of spatiotemporal chaos

Abstract:Autoregressive surrogate models (or \textit{emulators}) of spatiotemporal systems provide an avenue for fast, approximate predictions, with broad applications across science and engineering. At inference time, however, these models are generally unable to provide predictions over long time rollouts due to accumulation of errors leading to diverging trajectories. In essence, emulators operate out of distribution, and controlling the online distribution quickly becomes intractable in large-scale settings. To address this fundamental issue, and focusing on time-stationary systems admitting an invariant measure, we leverage diffusion models to obtain an implicit estimator of the score of this invariant measure. We show that this model of the score function can be used to stabilize autoregressive emulator rollouts by applying on-the-fly denoising during inference, a process we call \textit{thermalization}. Thermalizing an emulator rollout is shown to extend the time horizon of stable predictions by an order of magnitude in complex systems exhibiting turbulent and chaotic behavior, opening up a novel application of diffusion models in the context of neural emulation.

Via

Access Paper or Ask Questions

Compositional Reasoning with Transformers, RNNs, and Chain of Thought

Mar 03, 2025

Gilad Yehudai, Noah Amsel, Joan Bruna

Figure 1 for Compositional Reasoning with Transformers, RNNs, and Chain of Thought

Figure 2 for Compositional Reasoning with Transformers, RNNs, and Chain of Thought

Figure 3 for Compositional Reasoning with Transformers, RNNs, and Chain of Thought

Figure 4 for Compositional Reasoning with Transformers, RNNs, and Chain of Thought

Abstract:We study and compare the expressive power of transformers, RNNs, and transformers with chain of thought tokens on a simple and natural class of problems we term Compositional Reasoning Questions (CRQ). This family captures problems like evaluating Boolean formulas and multi-step word problems. Assuming standard hardness assumptions from circuit complexity and communication complexity, we prove that none of these three architectures is capable of solving CRQs unless some hyperparameter (depth, embedding dimension, and number of chain of thought tokens, respectively) grows with the size of the input. We also provide a construction for each architecture that solves CRQs. For transformers, our construction uses depth that is logarithmic in the problem size. For RNNs, logarithmic embedding dimension is necessary and sufficient, so long as the inputs are provided in a certain order. (Otherwise, a linear dimension is necessary). For transformers with chain of thought, our construction uses $n$ CoT tokens. These results show that, while CRQs are inherently hard, there are several different ways for language models to overcome this hardness. Even for a single class of problems, each architecture has strengths and weaknesses, and none is strictly better than the others.

Via

Access Paper or Ask Questions

Geometry and Optimization of Shallow Polynomial Networks

Jan 10, 2025

Yossi Arjevani, Joan Bruna, Joe Kileel, Elzbieta Polak, Matthew Trager

Abstract:We study shallow neural networks with polynomial activations. The function space for these models can be identified with a set of symmetric tensors with bounded rank. We describe general features of these networks, focusing on the relationship between width and optimization. We then consider teacher-student problems, that can be viewed as a problem of low-rank tensor approximation with respect to a non-standard inner product that is induced by the data distribution. In this setting, we introduce a teacher-metric discriminant which encodes the qualitative behavior of the optimization as a function of the training data distribution. Finally, we focus on networks with quadratic activations, presenting an in-depth analysis of the optimization landscape. In particular, we present a variation of the Eckart-Young Theorem characterizing all critical points and their Hessian signatures for teacher-student problems with quadratic networks and Gaussian training data.

* 36 pages, 2 figures

Via

Access Paper or Ask Questions

On the Benefits of Rank in Attention Layers

Jul 23, 2024

Noah Amsel, Gilad Yehudai, Joan Bruna

Figure 1 for On the Benefits of Rank in Attention Layers

Figure 2 for On the Benefits of Rank in Attention Layers

Figure 3 for On the Benefits of Rank in Attention Layers

Figure 4 for On the Benefits of Rank in Attention Layers

Abstract:Attention-based mechanisms are widely used in machine learning, most prominently in transformers. However, hyperparameters such as the rank of the attention matrices and the number of heads are scaled nearly the same way in all realizations of this architecture, without theoretical justification. In this work we show that there are dramatic trade-offs between the rank and number of heads of the attention mechanism. Specifically, we present a simple and natural target function that can be represented using a single full-rank attention head for any context length, but that cannot be approximated by low-rank attention unless the number of heads is exponential in the embedding dimension, even for short context lengths. Moreover, we prove that, for short context lengths, adding depth allows the target to be approximated by low-rank attention. For long contexts, we conjecture that full-rank attention is necessary. Finally, we present experiments with off-the-shelf transformers that validate our theoretical findings.

Via

Access Paper or Ask Questions

Posterior Sampling with Denoising Oracles via Tilted Transport

Jun 30, 2024

Joan Bruna, Jiequn Han

Abstract:Score-based diffusion models have significantly advanced high-dimensional data generation across various domains, by learning a denoising oracle (or score) from datasets. From a Bayesian perspective, they offer a realistic modeling of data priors and facilitate solving inverse problems through posterior sampling. Although many heuristic methods have been developed recently for this purpose, they lack the quantitative guarantees needed in many scientific applications. In this work, we introduce the \textit{tilted transport} technique, which leverages the quadratic structure of the log-likelihood in linear inverse problems in combination with the prior denoising oracle to transform the original posterior sampling problem into a new `boosted' posterior that is provably easier to sample from. We quantify the conditions under which this boosted posterior is strongly log-concave, highlighting the dependencies on the condition number of the measurement matrix and the signal-to-noise ratio. The resulting posterior sampling scheme is shown to reach the computational threshold predicted for sampling Ising models [Kunisky'23] with a direct analysis, and is further validated on high-dimensional Gaussian mixture models and scalar field $\varphi^4$ models.

Via

Access Paper or Ask Questions

How Truncating Weights Improves Reasoning in Language Models

Jun 05, 2024

Lei Chen, Joan Bruna, Alberto Bietti

Abstract:In addition to the ability to generate fluent text in various languages, large language models have been successful at tasks that involve basic forms of logical "reasoning" over their context. Recent work found that selectively removing certain components from weight matrices in pre-trained models can improve such reasoning capabilities. We investigate this phenomenon further by carefully studying how certain global associations tend to be stored in specific weight components or Transformer blocks, in particular feed-forward layers. Such associations may hurt predictions in reasoning tasks, and removing the corresponding components may then improve performance. We analyze how this arises during training, both empirically and theoretically, on a two-layer Transformer trained on a basic reasoning task with noise, a toy associative memory model, and on the Pythia family of pre-trained models tested on simple reasoning tasks.

Via

Access Paper or Ask Questions

Computational-Statistical Gaps in Gaussian Single-Index Models

Mar 12, 2024

Alex Damian, Loucas Pillaud-Vivien, Jason D. Lee, Joan Bruna

Figure 1 for Computational-Statistical Gaps in Gaussian Single-Index Models

Figure 2 for Computational-Statistical Gaps in Gaussian Single-Index Models

Figure 3 for Computational-Statistical Gaps in Gaussian Single-Index Models

Figure 4 for Computational-Statistical Gaps in Gaussian Single-Index Models

Abstract:Single-Index Models are high-dimensional regression problems with planted structure, whereby labels depend on an unknown one-dimensional projection of the input via a generic, non-linear, and potentially non-deterministic transformation. As such, they encompass a broad class of statistical inference tasks, and provide a rich template to study statistical and computational trade-offs in the high-dimensional regime. While the information-theoretic sample complexity to recover the hidden direction is linear in the dimension $d$, we show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $\Omega(d^{k^\star/2})$ samples, where $k^\star$ is a "generative" exponent associated with the model that we explicitly characterize. Moreover, we show that this sample complexity is also sufficient, by establishing matching upper bounds using a partial-trace algorithm. Therefore, our results provide evidence of a sharp computational-to-statistical gap (under both the SQ and LDP class) whenever $k^\star>2$. To complete the study, we provide examples of smooth and Lipschitz deterministic target functions with arbitrarily large generative exponents $k^\star$.

* 61 pages

Via

Access Paper or Ask Questions