Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Andres Potapczynski

Training Flexible Models of Genetic Variant Effects from Functional Annotations using Accelerated Linear Algebra

Jun 24, 2025

Alan N. Amin, Andres Potapczynski, Andrew Gordon Wilson

Abstract:To understand how genetic variants in human genomes manifest in phenotypes -- traits like height or diseases like asthma -- geneticists have sequenced and measured hundreds of thousands of individuals. Geneticists use this data to build models that predict how a genetic variant impacts phenotype given genomic features of the variant, like DNA accessibility or the presence of nearby DNA-bound proteins. As more data and features become available, one might expect predictive models to improve. Unfortunately, training these models is bottlenecked by the need to solve expensive linear algebra problems because variants in the genome are correlated with nearby variants, requiring inversion of large matrices. Previous methods have therefore been restricted to fitting small models, and fitting simplified summary statistics, rather than the full likelihood of the statistical model. In this paper, we leverage modern fast linear algebra techniques to develop DeepWAS (Deep genome Wide Association Studies), a method to train large and flexible neural network predictive models to optimize likelihood. Notably, we find that larger models only improve performance when using our full likelihood approach; when trained by fitting traditional summary statistics, larger models perform no better than small ones. We find larger models trained on more features make better predictions, potentially improving disease predictions and therapeutic target identification.

* For example: ICML 2025. Code available at: https://github.com/AlanNawzadAmin/DeepWAS

Via

Access Paper or Ask Questions

Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

Oct 03, 2024

Andres Potapczynski, Shikai Qiu, Marc Finzi, Christopher Ferri, Zixi Chen, Micah Goldblum, Bayan Bruss, Christopher De Sa, Andrew Gordon Wilson

Figure 1 for Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

Figure 2 for Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

Figure 3 for Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

Figure 4 for Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

Abstract:Dense linear layers are the dominant computational bottleneck in large neural networks, presenting a critical need for more efficient alternatives. Previous efforts focused on a small number of hand-crafted structured matrices and neglected to investigate whether these structures can surpass dense layers in terms of compute-optimal scaling laws when both the model size and training examples are optimally allocated. In this work, we present a unifying framework that enables searching among all linear operators expressible via an Einstein summation. This framework encompasses many previously proposed structures, such as low-rank, Kronecker, Tensor-Train, Block Tensor-Train (BTT), and Monarch, along with many novel structures. To analyze the framework, we develop a taxonomy of all such operators based on their computational and algebraic properties and show that differences in the compute-optimal scaling laws are mostly governed by a small number of variables that we introduce. Namely, a small $\omega$ (which measures parameter sharing) and large $\psi$ (which measures the rank) reliably led to better scaling laws. Guided by the insight that full-rank structures that maximize parameters per unit of compute perform the best, we propose BTT-MoE, a novel Mixture-of-Experts (MoE) architecture obtained by sparsifying computation in the BTT structure. In contrast to the standard sparse MoE for each entire feed-forward network, BTT-MoE learns an MoE in every single linear layer of the model, including the projection matrices in the attention blocks. We find BTT-MoE provides a substantial compute-efficiency gain over dense layers and standard MoE.

* NeurIPS 2024. Code available at https://github.com/AndPotap/einsum-search

Via

Access Paper or Ask Questions

Compute Better Spent: Replacing Dense Layers with Structured Matrices

Jun 10, 2024

Shikai Qiu, Andres Potapczynski, Marc Finzi, Micah Goldblum, Andrew Gordon Wilson

Figure 1 for Compute Better Spent: Replacing Dense Layers with Structured Matrices

Figure 2 for Compute Better Spent: Replacing Dense Layers with Structured Matrices

Figure 3 for Compute Better Spent: Replacing Dense Layers with Structured Matrices

Figure 4 for Compute Better Spent: Replacing Dense Layers with Structured Matrices

Abstract:Dense linear layers are the dominant computational bottleneck in foundation models. Identifying more efficient alternatives to dense matrices has enormous potential for building more compute-efficient models, as exemplified by the success of convolutional networks in the image domain. In this work, we systematically explore structured matrices as replacements for dense matrices. We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance, especially as models scale. Using insights from the Maximal Update Parameterization, we determine the optimal scaling for initialization and learning rates of these unconventional layers. Finally, we measure the scaling laws of different structures to compare how quickly their performance improves with compute. We propose a novel matrix family containing Monarch matrices, the Block Tensor-Train (BTT), which we show performs better than dense matrices for the same compute on multiple tasks. On CIFAR-10/100 with augmentation, BTT achieves exponentially lower training loss than dense when training MLPs and ViTs. BTT matches dense ViT-S/32 performance on ImageNet-1k with 3.8 times less compute and is more efficient than dense for training small GPT-2 language models.

* ICML 24. Code available at https://github.com/shikaiqiu/compute-better-spent

Via

Access Paper or Ask Questions

CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra

Sep 06, 2023

Andres Potapczynski, Marc Finzi, Geoff Pleiss, Andrew Gordon Wilson

Figure 1 for CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra

Figure 2 for CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra

Figure 3 for CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra

Figure 4 for CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra

Abstract:Many areas of machine learning and science involve large linear algebra problems, such as eigendecompositions, solving linear systems, computing matrix exponentials, and trace estimation. The matrices involved often have Kronecker, convolutional, block diagonal, sum, or product structure. In this paper, we propose a simple but general framework for large-scale linear algebra problems in machine learning, named CoLA (Compositional Linear Algebra). By combining a linear operator abstraction with compositional dispatch rules, CoLA automatically constructs memory and runtime efficient numerical algorithms. Moreover, CoLA provides memory efficient automatic differentiation, low precision computation, and GPU acceleration in both JAX and PyTorch, while also accommodating new objects, operations, and rules in downstream packages via multiple dispatch. CoLA can accelerate many algebraic operations, while making it easy to prototype matrix structures and algorithms, providing an appealing drop-in tool for virtually any computational effort that requires linear algebra. We showcase its efficacy across a broad range of applications, including partial differential equations, Gaussian processes, equivariant model construction, and unsupervised learning.

* Code available at https://github.com/wilson-labs/cola

Via

Access Paper or Ask Questions

Simple and Fast Group Robustness by Automatic Feature Reweighting

Jun 19, 2023

Shikai Qiu, Andres Potapczynski, Pavel Izmailov, Andrew Gordon Wilson

Abstract:A major challenge to out-of-distribution generalization is reliance on spurious features -- patterns that are predictive of the class label in the training data distribution, but not causally related to the target. Standard methods for reducing the reliance on spurious features typically assume that we know what the spurious feature is, which is rarely true in the real world. Methods that attempt to alleviate this limitation are complex, hard to tune, and lead to a significant computational overhead compared to standard training. In this paper, we propose Automatic Feature Reweighting (AFR), an extremely simple and fast method for updating the model to reduce the reliance on spurious features. AFR retrains the last layer of a standard ERM-trained base model with a weighted loss that emphasizes the examples where the ERM model predicts poorly, automatically upweighting the minority group without group labels. With this simple procedure, we improve upon the best reported results among competing methods trained without spurious attributes on several vision and natural language classification benchmarks, using only a fraction of their compute.

* 40th International Conference on Machine Learning 2023
* ICML 23. Code available at https://github.com/AndPotap/afr

Via

Access Paper or Ask Questions

A Stable and Scalable Method for Solving Initial Value PDEs with Neural Networks

Apr 28, 2023

Marc Finzi, Andres Potapczynski, Matthew Choptuik, Andrew Gordon Wilson

Figure 1 for A Stable and Scalable Method for Solving Initial Value PDEs with Neural Networks

Figure 2 for A Stable and Scalable Method for Solving Initial Value PDEs with Neural Networks

Figure 3 for A Stable and Scalable Method for Solving Initial Value PDEs with Neural Networks

Figure 4 for A Stable and Scalable Method for Solving Initial Value PDEs with Neural Networks

Abstract:Unlike conventional grid and mesh based methods for solving partial differential equations (PDEs), neural networks have the potential to break the curse of dimensionality, providing approximate solutions to problems where using classical solvers is difficult or impossible. While global minimization of the PDE residual over the network parameters works well for boundary value problems, catastrophic forgetting impairs the applicability of this approach to initial value problems (IVPs). In an alternative local-in-time approach, the optimization problem can be converted into an ordinary differential equation (ODE) on the network parameters and the solution propagated forward in time; however, we demonstrate that current methods based on this approach suffer from two key issues. First, following the ODE produces an uncontrolled growth in the conditioning of the problem, ultimately leading to unacceptably large numerical errors. Second, as the ODE methods scale cubically with the number of model parameters, they are restricted to small neural networks, significantly limiting their ability to represent intricate PDE initial conditions and solutions. Building on these insights, we develop Neural IVP, an ODE based IVP solver which prevents the network from getting ill-conditioned and runs in time linear in the number of parameters, enabling us to evolve the dynamics of challenging PDEs with neural networks.

* ICLR 2023. Code available at https://github.com/mfinzi/neural-ivp

Via

Access Paper or Ask Questions

PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization

Nov 24, 2022

Sanae Lotfi, Marc Finzi, Sanyam Kapoor, Andres Potapczynski, Micah Goldblum, Andrew Gordon Wilson

Abstract:While there has been progress in developing non-vacuous generalization bounds for deep neural networks, these bounds tend to be uninformative about why deep learning works. In this paper, we develop a compression approach based on quantizing neural network parameters in a linear subspace, profoundly improving on previous results to provide state-of-the-art generalization bounds on a variety of tasks, including transfer learning. We use these tight bounds to better understand the role of model size, equivariance, and the implicit biases of optimization, for generalization in deep learning. Notably, we find large models can be compressed to a much greater extent than previously known, encapsulating Occam's razor. We also argue for data-independent bounds in explaining generalization.

* NeurIPS 2022. Code is available at https://github.com/activatedgeek/tight-pac-bayes

Via

Access Paper or Ask Questions

Low-Precision Arithmetic for Fast Gaussian Processes

Jul 14, 2022

Wesley J. Maddox, Andres Potapczynski, Andrew Gordon Wilson

Figure 1 for Low-Precision Arithmetic for Fast Gaussian Processes

Figure 2 for Low-Precision Arithmetic for Fast Gaussian Processes

Figure 3 for Low-Precision Arithmetic for Fast Gaussian Processes

Figure 4 for Low-Precision Arithmetic for Fast Gaussian Processes

Abstract:Low-precision arithmetic has had a transformative effect on the training of neural networks, reducing computation, memory and energy requirements. However, despite its promise, low-precision arithmetic has received little attention for Gaussian processes (GPs), largely because GPs require sophisticated linear algebra routines that are unstable in low-precision. We study the different failure modes that can occur when training GPs in half precision. To circumvent these failure modes, we propose a multi-faceted approach involving conjugate gradients with re-orthogonalization, mixed precision, and preconditioning. Our approach significantly improves the numerical stability and practical performance of conjugate gradients in low-precision over a wide range of settings, enabling GPs to train on $1.8$ million data points in $10$ hours on a single GPU, without any sparse approximations.

* UAI 2022. Code available at https://github.com/AndPotap/halfpres_gps

Via

Access Paper or Ask Questions

On the Normalizing Constant of the Continuous Categorical Distribution

Apr 28, 2022

Elliott Gordon-Rodriguez, Gabriel Loaiza-Ganem, Andres Potapczynski, John P. Cunningham

Figure 1 for On the Normalizing Constant of the Continuous Categorical Distribution

Figure 2 for On the Normalizing Constant of the Continuous Categorical Distribution

Abstract:Probability distributions supported on the simplex enjoy a wide range of applications across statistics and machine learning. Recently, a novel family of such distributions has been discovered: the continuous categorical. This family enjoys remarkable mathematical simplicity; its density function resembles that of the Dirichlet distribution, but with a normalizing constant that can be written in closed form using elementary functions only. In spite of this mathematical simplicity, our understanding of the normalizing constant remains far from complete. In this work, we characterize the numerical behavior of the normalizing constant and we present theoretical and methodological advances that can, in turn, help to enable broader applications of the continuous categorical distribution. Our code is available at https://github.com/cunningham-lab/cb_and_cc/.

Via

Access Paper or Ask Questions

Bias-Free Scalable Gaussian Processes via Randomized Truncations

Feb 12, 2021

Andres Potapczynski, Luhuan Wu, Dan Biderman, Geoff Pleiss, John P. Cunningham

Figure 1 for Bias-Free Scalable Gaussian Processes via Randomized Truncations

Figure 2 for Bias-Free Scalable Gaussian Processes via Randomized Truncations

Figure 3 for Bias-Free Scalable Gaussian Processes via Randomized Truncations

Figure 4 for Bias-Free Scalable Gaussian Processes via Randomized Truncations

Abstract:Scalable Gaussian Process methods are computationally attractive, yet introduce modeling biases that require rigorous study. This paper analyzes two common techniques: early truncated conjugate gradients (CG) and random Fourier features (RFF). We find that both methods introduce a systematic bias on the learned hyperparameters: CG tends to underfit while RFF tends to overfit. We address these issues using randomized truncation estimators that eliminate bias in exchange for increased variance. In the case of RFF, we show that the bias-to-variance conversion is indeed a trade-off: the additional variance proves detrimental to optimization. However, in the case of CG, our unbiased learning procedure meaningfully outperforms its biased counterpart with minimal additional computation.

Via

Access Paper or Ask Questions