Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Oscar Key

Approximate Top-$k$ for Increased Parallelism

Dec 05, 2024

Oscar Key, Luka Ribar, Alberto Cattaneo, Luke Hudlass-Galley, Douglas Orr

Figure 1 for Approximate Top-$k$ for Increased Parallelism

Figure 2 for Approximate Top-$k$ for Increased Parallelism

Figure 3 for Approximate Top-$k$ for Increased Parallelism

Figure 4 for Approximate Top-$k$ for Increased Parallelism

Abstract:We present an evaluation of bucketed approximate top-$k$ algorithms. Computing top-$k$ exactly suffers from limited parallelism, because the $k$ largest values must be aggregated along the vector, thus is not well suited to computation on highly-parallel machine learning accelerators. By relaxing the requirement that the top-$k$ is exact, bucketed algorithms can dramatically increase the parallelism available by independently computing many smaller top-$k$ operations. We explore the design choices of this class of algorithms using both theoretical analysis and empirical evaluation on downstream tasks. Our motivating examples are sparsity algorithms for language models, which often use top-$k$ to select the most important parameters or activations. We also release a fast bucketed top-$k$ implementation for PyTorch.

Via

Access Paper or Ask Questions

Scalable Data Assimilation with Message Passing

Apr 19, 2024

Oscar Key, So Takao, Daniel Giles, Marc Peter Deisenroth

Figure 1 for Scalable Data Assimilation with Message Passing

Figure 2 for Scalable Data Assimilation with Message Passing

Figure 3 for Scalable Data Assimilation with Message Passing

Figure 4 for Scalable Data Assimilation with Message Passing

Abstract:Data assimilation is a core component of numerical weather prediction systems. The large quantity of data processed during assimilation requires the computation to be distributed across increasingly many compute nodes, yet existing approaches suffer from synchronisation overhead in this setting. In this paper, we exploit the formulation of data assimilation as a Bayesian inference problem and apply a message-passing algorithm to solve the spatial inference problem. Since message passing is inherently based on local computations, this approach lends itself to parallel and distributed computation. In combination with a GPU-accelerated implementation, we can scale the algorithm to very large grid sizes while retaining good accuracy and compute and memory requirements.

Via

Access Paper or Ask Questions

No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models

Jul 26, 2023

Jean Kaddour, Oscar Key, Piotr Nawrot, Pasquale Minervini, Matt J. Kusner

Abstract:The computation necessary for training Transformer-based language models has skyrocketed in recent years. This trend has motivated research on efficient training algorithms designed to improve training, validation, and downstream performance faster than standard training. In this work, we revisit three categories of such algorithms: dynamic architectures (layer stacking, layer dropping), batch selection (selective backprop, RHO loss), and efficient optimizers (Lion, Sophia). When pre-training BERT and T5 with a fixed computation budget using such methods, we find that their training, validation, and downstream gains vanish compared to a baseline with a fully-decayed learning rate. We define an evaluation protocol that enables computation to be done on arbitrary machines by mapping all computation time to a reference machine which we call reference system time. We discuss the limitations of our proposed protocol and release our code to encourage rigorous research in efficient training procedures: https://github.com/JeanKaddour/NoTrainNoGain.

Via

Access Paper or Ask Questions

Optimally-Weighted Estimators of the Maximum Mean Discrepancy for Likelihood-Free Inference

Jan 30, 2023

Ayush Bharti, Masha Naslidnyk, Oscar Key, Samuel Kaski, François-Xavier Briol

Figure 1 for Optimally-Weighted Estimators of the Maximum Mean Discrepancy for Likelihood-Free Inference

Figure 2 for Optimally-Weighted Estimators of the Maximum Mean Discrepancy for Likelihood-Free Inference

Figure 3 for Optimally-Weighted Estimators of the Maximum Mean Discrepancy for Likelihood-Free Inference

Figure 4 for Optimally-Weighted Estimators of the Maximum Mean Discrepancy for Likelihood-Free Inference

Abstract:Likelihood-free inference methods typically make use of a distance between simulated and real data. A common example is the maximum mean discrepancy (MMD), which has previously been used for approximate Bayesian computation, minimum distance estimation, generalised Bayesian inference, and within the nonparametric learning framework. The MMD is commonly estimated at a root-$m$ rate, where $m$ is the number of simulated samples. This can lead to significant computational challenges since a large $m$ is required to obtain an accurate estimate, which is crucial for parameter estimation. In this paper, we propose a novel estimator for the MMD with significantly improved sample complexity. The estimator is particularly well suited for computationally expensive smooth simulators with low- to mid-dimensional inputs. This claim is supported through both theoretical results and an extensive simulation study on benchmark simulators.

Via

Access Paper or Ask Questions

Towards Healing the Blindness of Score Matching

Sep 15, 2022

Mingtian Zhang, Oscar Key, Peter Hayes, David Barber, Brooks Paige, François-Xavier Briol

Figure 1 for Towards Healing the Blindness of Score Matching

Figure 2 for Towards Healing the Blindness of Score Matching

Figure 3 for Towards Healing the Blindness of Score Matching

Figure 4 for Towards Healing the Blindness of Score Matching

Abstract:Score-based divergences have been widely used in machine learning and statistics applications. Despite their empirical success, a blindness problem has been observed when using these for multi-modal distributions. In this work, we discuss the blindness problem and propose a new family of divergences that can mitigate the blindness problem. We illustrate our proposed divergence in the context of density estimation and report improved performance compared to traditional approaches.

Via

Access Paper or Ask Questions

Composite Goodness-of-fit Tests with Kernels

Nov 19, 2021

Oscar Key, Tamara Fernandez, Arthur Gretton, François-Xavier Briol

Figure 1 for Composite Goodness-of-fit Tests with Kernels

Figure 2 for Composite Goodness-of-fit Tests with Kernels

Abstract:Model misspecification can create significant challenges for the implementation of probabilistic models, and this has led to development of a range of inference methods which directly account for this issue. However, whether these more involved methods are required will depend on whether the model is really misspecified, and there is a lack of generally applicable methods to answer this question. One set of tools which can help are goodness-of-fit tests, where we test whether a dataset could have been generated by a fixed distribution. Kernel-based tests have been developed to for this problem, and these are popular due to their flexibility, strong theoretical guarantees and ease of implementation in a wide range of scenarios. In this paper, we extend this line of work to the more challenging composite goodness-of-fit problem, where we are instead interested in whether the data comes from any distribution in some parametric family. This is equivalent to testing whether a parametric model is well-specified for the data.

Via

Access Paper or Ask Questions

Generating Interpretable Counterfactual Explanations By Implicit Minimisation of Epistemic and Aleatoric Uncertainties

Mar 16, 2021

Lisa Schut, Oscar Key, Rory McGrath, Luca Costabello, Bogdan Sacaleanu, Medb Corcoran, Yarin Gal

Figure 1 for Generating Interpretable Counterfactual Explanations By Implicit Minimisation of Epistemic and Aleatoric Uncertainties

Figure 2 for Generating Interpretable Counterfactual Explanations By Implicit Minimisation of Epistemic and Aleatoric Uncertainties

Figure 3 for Generating Interpretable Counterfactual Explanations By Implicit Minimisation of Epistemic and Aleatoric Uncertainties

Figure 4 for Generating Interpretable Counterfactual Explanations By Implicit Minimisation of Epistemic and Aleatoric Uncertainties

Abstract:Counterfactual explanations (CEs) are a practical tool for demonstrating why machine learning classifiers make particular decisions. For CEs to be useful, it is important that they are easy for users to interpret. Existing methods for generating interpretable CEs rely on auxiliary generative models, which may not be suitable for complex datasets, and incur engineering overhead. We introduce a simple and fast method for generating interpretable CEs in a white-box setting without an auxiliary model, by using the predictive uncertainty of the classifier. Our experiments show that our proposed algorithm generates more interpretable CEs, according to IM1 scores, than existing methods. Additionally, our approach allows us to estimate the uncertainty of a CE, which may be important in safety-critical applications, such as those in the medical domain.

* Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS) 2021
* 21 pages, 13 Figures

Via

Access Paper or Ask Questions

Improving Deterministic Uncertainty Estimation in Deep Learning for Classification and Regression

Feb 22, 2021

Joost van Amersfoort, Lewis Smith, Andrew Jesson, Oscar Key, Yarin Gal

Figure 1 for Improving Deterministic Uncertainty Estimation in Deep Learning for Classification and Regression

Figure 2 for Improving Deterministic Uncertainty Estimation in Deep Learning for Classification and Regression

Figure 3 for Improving Deterministic Uncertainty Estimation in Deep Learning for Classification and Regression

Figure 4 for Improving Deterministic Uncertainty Estimation in Deep Learning for Classification and Regression

Abstract:We propose a new model that estimates uncertainty in a single forward pass and works on both classification and regression problems. Our approach combines a bi-Lipschitz feature extractor with an inducing point approximate Gaussian process, offering robust and principled uncertainty estimation. This can be seen as a refinement of Deep Kernel Learning (DKL), with our changes allowing DKL to match softmax neural networks accuracy. Our method overcomes the limitations of previous work addressing deterministic uncertainty quantification, such as the dependence of uncertainty on ad hoc hyper-parameters. Our method matches SotA accuracy, 96.2% on CIFAR-10, while maintaining the speed of softmax models, and provides uncertainty estimates that outperform previous single forward pass uncertainty models. Finally, we demonstrate our method on a recently introduced benchmark for uncertainty in regression: treatment deferral in causal models for personalized medicine.

Via

Access Paper or Ask Questions

On Signal-to-Noise Ratio Issues in Variational Inference for Deep Gaussian Processes

Nov 01, 2020

Tim G. J. Rudner, Oscar Key, Yarin Gal, Tom Rainforth

Figure 1 for On Signal-to-Noise Ratio Issues in Variational Inference for Deep Gaussian Processes

Figure 2 for On Signal-to-Noise Ratio Issues in Variational Inference for Deep Gaussian Processes

Figure 3 for On Signal-to-Noise Ratio Issues in Variational Inference for Deep Gaussian Processes

Figure 4 for On Signal-to-Noise Ratio Issues in Variational Inference for Deep Gaussian Processes

Abstract:We show that the gradient estimates used in training Deep Gaussian Processes (DGPs) with importance-weighted variational inference are susceptible to signal-to-noise ratio (SNR) issues. Specifically, we show both theoretically and empirically that the SNR of the gradient estimates for the latent variable's variational parameters decreases as the number of importance samples increases. As a result, these gradient estimates degrade to pure noise if the number of importance samples is too large. To address this pathology, we show how doubly-reparameterized gradient estimators, originally proposed for training variational autoencoders, can be adapted to the DGP setting and that the resultant estimators completely remedy the SNR issue, thereby providing more reliable training. Finally, we demonstrate that our fix can lead to improvements in the predictive performance of the model's predictive posterior.

Via

Access Paper or Ask Questions

Interlocking Backpropagation: Improving depthwise model-parallelism

Oct 08, 2020

Aidan N. Gomez, Oscar Key, Stephen Gou, Nick Frosst, Jeff Dean, Yarin Gal

Figure 1 for Interlocking Backpropagation: Improving depthwise model-parallelism

Figure 2 for Interlocking Backpropagation: Improving depthwise model-parallelism

Figure 3 for Interlocking Backpropagation: Improving depthwise model-parallelism

Figure 4 for Interlocking Backpropagation: Improving depthwise model-parallelism

Abstract:The number of parameters in state of the art neural networks has drastically increased in recent years. This surge of interest in large scale neural networks has motivated the development of new distributed training strategies enabling such models. One such strategy is model-parallel distributed training. Unfortunately, model-parallelism suffers from poor resource utilisation, which leads to wasted resources. In this work, we improve upon recent developments in an idealised model-parallel optimisation setting: local learning. Motivated by poor resource utilisation, we introduce a class of intermediary strategies between local and global learning referred to as interlocking backpropagation. These strategies preserve many of the compute-efficiency advantages of local optimisation, while recovering much of the task performance achieved by global optimisation. We assess our strategies on both image classification ResNets and Transformer language models, finding that our strategy consistently out-performs local learning in terms of task performance, and out-performs global learning in training efficiency.

Via

Access Paper or Ask Questions