Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bruno Mlodozeniec

Distributional Training Data Attribution

Jun 15, 2025

Bruno Mlodozeniec, Isaac Reid, Sam Power, David Krueger, Murat Erdogdu, Richard E. Turner, Roger Grosse

Abstract:Randomness is an unavoidable part of training deep learning models, yet something that traditional training data attribution algorithms fail to rigorously account for. They ignore the fact that, due to stochasticity in the initialisation and batching, training on the same dataset can yield different models. In this paper, we address this shortcoming through introducing distributional training data attribution (d-TDA), the goal of which is to predict how the distribution of model outputs (over training runs) depends upon the dataset. We demonstrate the practical significance of d-TDA in experiments, e.g. by identifying training examples that drastically change the distribution of some target measurement without necessarily changing the mean. Intriguingly, we also find that influence functions (IFs), a popular but poorly-understood data attribution tool, emerge naturally from our distributional framework as the limit to unrolled differentiation; without requiring restrictive convexity assumptions. This provides a new mathematical motivation for their efficacy in deep learning, and helps to characterise their limitations.

Via

Access Paper or Ask Questions

Influence Functions for Scalable Data Attribution in Diffusion Models

Oct 17, 2024

Bruno Mlodozeniec, Runa Eschenhagen, Juhan Bae, Alexander Immer, David Krueger, Richard Turner

Abstract:Diffusion models have led to significant advancements in generative modelling. Yet their widespread adoption poses challenges regarding data attribution and interpretability. In this paper, we aim to help address such challenges in diffusion models by developing an \textit{influence functions} framework. Influence function-based data attribution methods approximate how a model's output would have changed if some training data were removed. In supervised learning, this is usually used for predicting how the loss on a particular example would change. For diffusion models, we focus on predicting the change in the probability of generating a particular example via several proxy measurements. We show how to formulate influence functions for such quantities and how previously proposed methods can be interpreted as particular design choices in our framework. To ensure scalability of the Hessian computations in influence functions, we systematically develop K-FAC approximations based on generalised Gauss-Newton matrices specifically tailored to diffusion models. We recast previously proposed methods as specific design choices in our framework and show that our recommended method outperforms previous data attribution approaches on common evaluations, such as the Linear Data-modelling Score (LDS) or retraining without top influences, without the need for method-specific hyperparameter tuning.

Via

Access Paper or Ask Questions

Improving Linear System Solvers for Hyperparameter Optimisation in Iterative Gaussian Processes

May 28, 2024

Jihao Andreas Lin, Shreyas Padhy, Bruno Mlodozeniec, Javier Antorán, José Miguel Hernández-Lobato

Figure 1 for Improving Linear System Solvers for Hyperparameter Optimisation in Iterative Gaussian Processes

Figure 2 for Improving Linear System Solvers for Hyperparameter Optimisation in Iterative Gaussian Processes

Figure 3 for Improving Linear System Solvers for Hyperparameter Optimisation in Iterative Gaussian Processes

Figure 4 for Improving Linear System Solvers for Hyperparameter Optimisation in Iterative Gaussian Processes

Abstract:Scaling hyperparameter optimisation to very large datasets remains an open problem in the Gaussian process community. This paper focuses on iterative methods, which use linear system solvers, like conjugate gradients, alternating projections or stochastic gradient descent, to construct an estimate of the marginal likelihood gradient. We discuss three key improvements which are applicable across solvers: (i) a pathwise gradient estimator, which reduces the required number of solver iterations and amortises the computational cost of making predictions, (ii) warm starting linear system solvers with the solution from the previous step, which leads to faster solver convergence at the cost of negligible bias, (iii) early stopping linear system solvers after a limited computational budget, which synergises with warm starting, allowing solver progress to accumulate over multiple marginal likelihood steps. These techniques provide speed-ups of up to $72\times$ when solving to tolerance, and decrease the average residual norm by up to $7\times$ when stopping early.

* arXiv admin note: text overlap with arXiv:2405.18328

Via

Access Paper or Ask Questions

Warm Start Marginal Likelihood Optimisation for Iterative Gaussian Processes

May 28, 2024

Jihao Andreas Lin, Shreyas Padhy, Bruno Mlodozeniec, José Miguel Hernández-Lobato

Figure 1 for Warm Start Marginal Likelihood Optimisation for Iterative Gaussian Processes

Figure 2 for Warm Start Marginal Likelihood Optimisation for Iterative Gaussian Processes

Figure 3 for Warm Start Marginal Likelihood Optimisation for Iterative Gaussian Processes

Figure 4 for Warm Start Marginal Likelihood Optimisation for Iterative Gaussian Processes

Abstract:Gaussian processes are a versatile probabilistic machine learning model whose effectiveness often depends on good hyperparameters, which are typically learned by maximising the marginal likelihood. In this work, we consider iterative methods, which use iterative linear system solvers to approximate marginal likelihood gradients up to a specified numerical precision, allowing a trade-off between compute time and accuracy of a solution. We introduce a three-level hierarchy of marginal likelihood optimisation for iterative Gaussian processes, and identify that the computational costs are dominated by solving sequential batches of large positive-definite systems of linear equations. We then propose to amortise computations by reusing solutions of linear system solvers as initialisations in the next step, providing a $\textit{warm start}$. Finally, we discuss the necessary conditions and quantify the consequences of warm starts and demonstrate their effectiveness on regression tasks, where warm starts achieve the same results as the conventional procedure while providing up to a $16 \times$ average speed-up among datasets.

* Advances in Approximate Bayesian Inference 2024

Via

Access Paper or Ask Questions

Denoising Diffusion Probabilistic Models in Six Simple Steps

Feb 10, 2024

Richard E. Turner, Cristiana-Diana Diaconu, Stratis Markou, Aliaksandra Shysheya, Andrew Y. K. Foong, Bruno Mlodozeniec

Abstract:Denoising Diffusion Probabilistic Models (DDPMs) are a very popular class of deep generative model that have been successfully applied to a diverse range of problems including image and video generation, protein and material synthesis, weather forecasting, and neural surrogates of partial differential equations. Despite their ubiquity it is hard to find an introduction to DDPMs which is simple, comprehensive, clean and clear. The compact explanations necessary in research papers are not able to elucidate all of the different design steps taken to formulate the DDPM and the rationale of the steps that are presented is often omitted to save space. Moreover, the expositions are typically presented from the variational lower bound perspective which is unnecessary and arguably harmful as it obfuscates why the method is working and suggests generalisations that do not perform well in practice. On the other hand, perspectives that take the continuous time-limit are beautiful and general, but they have a high barrier-to-entry as they require background knowledge of stochastic differential equations and probability flow. In this note, we distill down the formulation of the DDPM into six simple steps each of which comes with a clear rationale. We assume that the reader is familiar with fundamental topics in machine learning including basic probabilistic modelling, Gaussian distributions, maximum likelihood estimation, and deep learning.

Via

Access Paper or Ask Questions

Meta- (out-of-context) learning in neural networks

Oct 24, 2023

Dmitrii Krasheninnikov, Egor Krasheninnikov, Bruno Mlodozeniec, David Krueger

Figure 1 for Meta- (out-of-context) learning in neural networks

Figure 2 for Meta- (out-of-context) learning in neural networks

Figure 3 for Meta- (out-of-context) learning in neural networks

Figure 4 for Meta- (out-of-context) learning in neural networks

Abstract:Brown et al. (2020) famously introduced the phenomenon of in-context learning in large language models (LLMs). We establish the existence of a phenomenon we call meta-out-of-context learning (meta-OCL) via carefully designed synthetic experiments with LLMs. Our results suggest that meta-OCL leads LLMs to more readily "internalize" the semantic content of text that is, or appears to be, broadly useful (such as true statements, or text from authoritative sources) and use it in appropriate circumstances. We further demonstrate meta-OCL in a synthetic computer vision setting, and propose two hypotheses for the emergence of meta-OCL: one relying on the way models store knowledge in their parameters, and another suggesting that the implicit gradient alignment bias of gradient-descent-based optimizers may be responsible. Finally, we reflect on what our results might imply about capabilities of future AI systems, and discuss potential risks. Our code can be found at https://github.com/krasheninnikov/internalization.

Via

Access Paper or Ask Questions

Hyperparameter Optimization through Neural Network Partitioning

Apr 28, 2023

Bruno Mlodozeniec, Matthias Reisser, Christos Louizos

Abstract:Well-tuned hyperparameters are crucial for obtaining good generalization behavior in neural networks. They can enforce appropriate inductive biases, regularize the model and improve performance -- especially in the presence of limited data. In this work, we propose a simple and efficient way for optimizing hyperparameters inspired by the marginal likelihood, an optimization objective that requires no validation data. Our method partitions the training data and a neural network model into $K$ data shards and parameter partitions, respectively. Each partition is associated with and optimized only on specific data shards. Combining these partitions into subnetworks allows us to define the ``out-of-training-sample" loss of a subnetwork, i.e., the loss on data shards unseen by the subnetwork, as the objective for hyperparameter optimization. We demonstrate that we can apply this objective to optimize a variety of different hyperparameters in a single training run while being significantly computationally cheaper than alternative methods aiming to optimize the marginal likelihood for neural networks. Lastly, we also focus on optimizing hyperparameters in federated learning, where retraining and cross-validation are particularly challenging.

* Published as a conference paper at ICLR 2023

Via

Access Paper or Ask Questions

Timewarp: Transferable Acceleration of Molecular Dynamics by Learning Time-Coarsened Dynamics

Feb 02, 2023

Leon Klein, Andrew Y. K. Foong, Tor Erlend Fjelde, Bruno Mlodozeniec, Marc Brockschmidt, Sebastian Nowozin, Frank Noé, Ryota Tomioka

Figure 1 for Timewarp: Transferable Acceleration of Molecular Dynamics by Learning Time-Coarsened Dynamics

Figure 2 for Timewarp: Transferable Acceleration of Molecular Dynamics by Learning Time-Coarsened Dynamics

Figure 3 for Timewarp: Transferable Acceleration of Molecular Dynamics by Learning Time-Coarsened Dynamics

Figure 4 for Timewarp: Transferable Acceleration of Molecular Dynamics by Learning Time-Coarsened Dynamics

Abstract:Molecular dynamics (MD) simulation is a widely used technique to simulate molecular systems, most commonly at the all-atom resolution where the equations of motion are integrated with timesteps on the order of femtoseconds ($1\textrm{fs}=10^{-15}\textrm{s}$). MD is often used to compute equilibrium properties, which requires sampling from an equilibrium distribution such as the Boltzmann distribution. However, many important processes, such as binding and folding, occur over timescales of milliseconds or beyond, and cannot be efficiently sampled with conventional MD. Furthermore, new MD simulations need to be performed from scratch for each molecular system studied. We present Timewarp, an enhanced sampling method which uses a normalising flow as a proposal distribution in a Markov chain Monte Carlo method targeting the Boltzmann distribution. The flow is trained offline on MD trajectories and learns to make large steps in time, simulating the molecular dynamics of $10^{5} - 10^{6}\:\textrm{fs}$. Crucially, Timewarp is transferable between molecular systems: once trained, we show that it generalises to unseen small peptides (2-4 amino acids), exploring their metastable states and providing wall-clock acceleration when sampling compared to standard MD. Our method constitutes an important step towards developing general, transferable algorithms for accelerating MD.

Via

Access Paper or Ask Questions

Ensemble Distribution Distillation

May 30, 2019

Andrey Malinin, Bruno Mlodozeniec, Mark Gales

Figure 1 for Ensemble Distribution Distillation

Figure 2 for Ensemble Distribution Distillation

Figure 3 for Ensemble Distribution Distillation

Figure 4 for Ensemble Distribution Distillation

Abstract:Ensembles of models often yield improvements in system performance. They have also been empirically shown to yield robust measures of uncertainty, and are capable of distinguishing between different forms of uncertainty. However, ensembles come at a computational and memory cost which may be prohibitive for many applications. There has been significant work done on the distillation of an ensemble into a single model. Such approaches decrease computational cost and allow a single model to achieve an accuracy comparable to that of an ensemble. However, information about the diversity of the ensemble, which can yield estimates of different forms of uncertainty, is lost. Recently, a new class of models called Prior Networks has been proposed, which allows a single neural network to explicitly model a distribution over output distributions. In this work, Ensemble Distillation and Prior Networks are combined to yield a novel approach called Ensemble Distribution Distillation (EnD$^2$), in which the distribution of an ensemble is distilled into a single Prior Network. This enables a single model to retain both the improved classification performance as well as measures of diversity of the ensemble. The properties of EnD$^2$ have been investigated on both an artificial dataset, and on the CIFAR-10 dataset, where it is shown that EnD$^2$ can approach the performance of an ensemble, and outperforms both standard DNNs and standard Ensemble Distillation.

Via

Access Paper or Ask Questions