Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

Oct 03, 2024

Andres Potapczynski, Shikai Qiu, Marc Finzi, Christopher Ferri, Zixi Chen, Micah Goldblum, Bayan Bruss, Christopher De Sa, Andrew Gordon Wilson

Figure 1 for Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

Figure 2 for Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

Figure 3 for Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

Figure 4 for Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

Share this with someone who'll enjoy it:

Abstract:Dense linear layers are the dominant computational bottleneck in large neural networks, presenting a critical need for more efficient alternatives. Previous efforts focused on a small number of hand-crafted structured matrices and neglected to investigate whether these structures can surpass dense layers in terms of compute-optimal scaling laws when both the model size and training examples are optimally allocated. In this work, we present a unifying framework that enables searching among all linear operators expressible via an Einstein summation. This framework encompasses many previously proposed structures, such as low-rank, Kronecker, Tensor-Train, Block Tensor-Train (BTT), and Monarch, along with many novel structures. To analyze the framework, we develop a taxonomy of all such operators based on their computational and algebraic properties and show that differences in the compute-optimal scaling laws are mostly governed by a small number of variables that we introduce. Namely, a small $\omega$ (which measures parameter sharing) and large $\psi$ (which measures the rank) reliably led to better scaling laws. Guided by the insight that full-rank structures that maximize parameters per unit of compute perform the best, we propose BTT-MoE, a novel Mixture-of-Experts (MoE) architecture obtained by sparsifying computation in the BTT structure. In contrast to the standard sparse MoE for each entire feed-forward network, BTT-MoE learns an MoE in every single linear layer of the model, including the projection matrices in the attention blocks. We find BTT-MoE provides a substantial compute-efficiency gain over dense layers and standard MoE.

* NeurIPS 2024. Code available at https://github.com/AndPotap/einsum-search

View paper on

Share this with someone who'll enjoy it:

Title:Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

Paper and Code