Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Vimal Thilak

Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

Jan 21, 2025

Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin Mohamed Elnouby Ali, Josh Susskind, Vimal Thilak

Abstract:Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves increasing both, the precise interplay between these factors and their combined contribution to overall capacity remains not fully understood. We explore this relationship in the context of sparse Mixture-of-Expert models (MoEs), which allow scaling the number of parameters without proportionally increasing the FLOPs per example. We investigate how varying the sparsity level, i.e., the ratio of non-active to total parameters, affects model performance in terms of both pretraining and downstream performance. We find that under different constraints (e.g. parameter size and total training compute), there is an optimal level of sparsity that improves both training efficiency and model performance. These results provide a better understanding of the impact of sparsity in scaling laws for MoEs and complement existing works in this area, offering insights for designing more efficient architectures.

Via

Access Paper or Ask Questions

Enhancing JEPAs with Spatial Conditioning: Robust and Efficient Representation Learning

Oct 14, 2024

Etai Littwin, Vimal Thilak, Anand Gopalakrishnan

Figure 1 for Enhancing JEPAs with Spatial Conditioning: Robust and Efficient Representation Learning

Figure 2 for Enhancing JEPAs with Spatial Conditioning: Robust and Efficient Representation Learning

Figure 3 for Enhancing JEPAs with Spatial Conditioning: Robust and Efficient Representation Learning

Figure 4 for Enhancing JEPAs with Spatial Conditioning: Robust and Efficient Representation Learning

Abstract:Image-based Joint-Embedding Predictive Architecture (IJEPA) offers an attractive alternative to Masked Autoencoder (MAE) for representation learning using the Masked Image Modeling framework. IJEPA drives representations to capture useful semantic information by predicting in latent rather than input space. However, IJEPA relies on carefully designed context and target windows to avoid representational collapse. The encoder modules in IJEPA cannot adaptively modulate the type of predicted and/or target features based on the feasibility of the masked prediction task as they are not given sufficient information of both context and targets. Based on the intuition that in natural images, information has a strong spatial bias with spatially local regions being highly predictive of one another compared to distant ones. We condition the target encoder and context encoder modules in IJEPA with positions of context and target windows respectively. Our "conditional" encoders show performance gains on several image classification benchmark datasets, improved robustness to context window size and sample-efficiency during pretraining.

* NeurIPS 2024 Workshop on Self-Supervised Learning - Theory and Practice. Comments welcome!

Via

Access Paper or Ask Questions

Towards Automatic Assessment of Self-Supervised Speech Models using Rank

Sep 16, 2024

Zakaria Aldeneh, Vimal Thilak, Takuya Higuchi, Barry-John Theobald, Tatiana Likhomanenko

Abstract:This study explores using embedding rank as an unsupervised evaluation metric for general-purpose speech encoders trained via self-supervised learning (SSL). Traditionally, assessing the performance of these encoders is resource-intensive and requires labeled data from the downstream tasks. Inspired by the vision domain, where embedding rank has shown promise for evaluating image encoders without tuning on labeled downstream data, this work examines its applicability in the speech domain, considering the temporal nature of the signals. The findings indicate rank correlates with downstream performance within encoder layers across various downstream tasks and for in- and out-of-domain scenarios. However, rank does not reliably predict the best-performing layer for specific downstream tasks, as lower-ranked layers can outperform higher-ranked ones. Despite this limitation, the results suggest that embedding rank can be a valuable tool for monitoring training progress in SSL speech models, offering a less resource-demanding alternative to traditional evaluation methods.

* Submitted to ICASSP 2025

Via

Access Paper or Ask Questions

How JEPA Avoids Noisy Features: The Implicit Bias of Deep Linear Self Distillation Networks

Jul 03, 2024

Etai Littwin, Omid Saremi, Madhu Advani, Vimal Thilak, Preetum Nakkiran, Chen Huang, Joshua Susskind

Figure 1 for How JEPA Avoids Noisy Features: The Implicit Bias of Deep Linear Self Distillation Networks

Figure 2 for How JEPA Avoids Noisy Features: The Implicit Bias of Deep Linear Self Distillation Networks

Figure 3 for How JEPA Avoids Noisy Features: The Implicit Bias of Deep Linear Self Distillation Networks

Figure 4 for How JEPA Avoids Noisy Features: The Implicit Bias of Deep Linear Self Distillation Networks

Abstract:Two competing paradigms exist for self-supervised learning of data representations. Joint Embedding Predictive Architecture (JEPA) is a class of architectures in which semantically similar inputs are encoded into representations that are predictive of each other. A recent successful approach that falls under the JEPA framework is self-distillation, where an online encoder is trained to predict the output of the target encoder, sometimes using a lightweight predictor network. This is contrasted with the Masked AutoEncoder (MAE) paradigm, where an encoder and decoder are trained to reconstruct missing parts of the input in the data space rather, than its latent representation. A common motivation for using the JEPA approach over MAE is that the JEPA objective prioritizes abstract features over fine-grained pixel information (which can be unpredictable and uninformative). In this work, we seek to understand the mechanism behind this empirical observation by analyzing the training dynamics of deep linear models. We uncover a surprising mechanism: in a simplified linear setting where both approaches learn similar representations, JEPAs are biased to learn high-influence features, i.e., features characterized by having high regression coefficients. Our results point to a distinct implicit bias of predicting in latent space that may shed light on its success in practice.

* Technical report

Via

Access Paper or Ask Questions

LiDAR: Sensing Linear Probing Performance in Joint Embedding SSL Architectures

Dec 07, 2023

Vimal Thilak, Chen Huang, Omid Saremi, Laurent Dinh, Hanlin Goh, Preetum Nakkiran, Joshua M. Susskind, Etai Littwin

Figure 1 for LiDAR: Sensing Linear Probing Performance in Joint Embedding SSL Architectures

Figure 2 for LiDAR: Sensing Linear Probing Performance in Joint Embedding SSL Architectures

Figure 3 for LiDAR: Sensing Linear Probing Performance in Joint Embedding SSL Architectures

Figure 4 for LiDAR: Sensing Linear Probing Performance in Joint Embedding SSL Architectures

Abstract:Joint embedding (JE) architectures have emerged as a promising avenue for acquiring transferable data representations. A key obstacle to using JE methods, however, is the inherent challenge of evaluating learned representations without access to a downstream task, and an annotated dataset. Without efficient and reliable evaluation, it is difficult to iterate on architectural and training choices for JE methods. In this paper, we introduce LiDAR (Linear Discriminant Analysis Rank), a metric designed to measure the quality of representations within JE architectures. Our metric addresses several shortcomings of recent approaches based on feature covariance rank by discriminating between informative and uninformative features. In essence, LiDAR quantifies the rank of the Linear Discriminant Analysis (LDA) matrix associated with the surrogate SSL task -- a measure that intuitively captures the information content as it pertains to solving the SSL task. We empirically demonstrate that LiDAR significantly surpasses naive rank based approaches in its predictive power of optimal hyperparameters. Our proposed criterion presents a more robust and intuitive means of assessing the quality of representations within JE architectures, which we hope facilitates broader adoption of these powerful techniques in various domains.

* Technical report

Via

Access Paper or Ask Questions

Vanishing Gradients in Reinforcement Finetuning of Language Models

Oct 31, 2023

Noam Razin, Hattie Zhou, Omid Saremi, Vimal Thilak, Arwen Bradley, Preetum Nakkiran, Joshua Susskind, Etai Littwin

Figure 1 for Vanishing Gradients in Reinforcement Finetuning of Language Models

Figure 2 for Vanishing Gradients in Reinforcement Finetuning of Language Models

Figure 3 for Vanishing Gradients in Reinforcement Finetuning of Language Models

Figure 4 for Vanishing Gradients in Reinforcement Finetuning of Language Models

Abstract:Pretrained language models are commonly aligned with human preferences and downstream tasks via reinforcement finetuning (RFT), which entails maximizing a (possibly learned) reward function using policy gradient algorithms. This work highlights a fundamental optimization obstacle in RFT: we prove that the expected gradient for an input vanishes when its reward standard deviation under the model is small, even if the expected reward is far from optimal. Through experiments on an RFT benchmark and controlled environments, as well as a theoretical analysis, we then demonstrate that vanishing gradients due to small reward standard deviation are prevalent and detrimental, leading to extremely slow reward maximization. Lastly, we explore ways to overcome vanishing gradients in RFT. We find the common practice of an initial supervised finetuning (SFT) phase to be the most promising candidate, which sheds light on its importance in an RFT pipeline. Moreover, we show that a relatively small number of SFT optimization steps on as few as 1% of the input samples can suffice, indicating that the initial SFT phase need not be expensive in terms of compute and data labeling efforts. Overall, our results emphasize that being mindful for inputs whose expected gradient vanishes, as measured by the reward standard deviation, is crucial for successful execution of RFT.

Via

Access Paper or Ask Questions

Adaptivity and Modularity for Efficient Generalization Over Task Complexity

Oct 13, 2023

Samira Abnar, Omid Saremi, Laurent Dinh, Shantel Wilson, Miguel Angel Bautista, Chen Huang, Vimal Thilak, Etai Littwin, Jiatao Gu, Josh Susskind(+1 more)

Figure 1 for Adaptivity and Modularity for Efficient Generalization Over Task Complexity

Figure 2 for Adaptivity and Modularity for Efficient Generalization Over Task Complexity

Figure 3 for Adaptivity and Modularity for Efficient Generalization Over Task Complexity

Figure 4 for Adaptivity and Modularity for Efficient Generalization Over Task Complexity

Abstract:Can transformers generalize efficiently on problems that require dealing with examples with different levels of difficulty? We introduce a new task tailored to assess generalization over different complexities and present results that indicate that standard transformers face challenges in solving these tasks. These tasks are variations of pointer value retrieval previously introduced by Zhang et al. (2021). We investigate how the use of a mechanism for adaptive and modular computation in transformers facilitates the learning of tasks that demand generalization over the number of sequential computation steps (i.e., the depth of the computation graph). Based on our observations, we propose a transformer-based architecture called Hyper-UT, which combines dynamic function generation from hyper networks with adaptive depth from Universal Transformers. This model demonstrates higher accuracy and a fairer allocation of computational resources when generalizing to higher numbers of computation steps. We conclude that mechanisms for adaptive depth and modularity complement each other in improving efficient generalization concerning example complexity. Additionally, to emphasize the broad applicability of our findings, we illustrate that in a standard image recognition task, Hyper- UT's performance matches that of a ViT model but with considerably reduced computational demands (achieving over 70\% average savings by effectively using fewer layers).

Via

Access Paper or Ask Questions

The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon

Jun 13, 2022

Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, Joshua Susskind

Figure 1 for The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon

Figure 2 for The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon

Figure 3 for The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon

Figure 4 for The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon

Abstract:The grokking phenomenon as reported by Power et al. ( arXiv:2201.02177 ) refers to a regime where a long period of overfitting is followed by a seemingly sudden transition to perfect generalization. In this paper, we attempt to reveal the underpinnings of Grokking via a series of empirical studies. Specifically, we uncover an optimization anomaly plaguing adaptive optimizers at extremely late stages of training, referred to as the Slingshot Mechanism. A prominent artifact of the Slingshot Mechanism can be measured by the cyclic phase transitions between stable and unstable training regimes, and can be easily monitored by the cyclic behavior of the norm of the last layers weights. We empirically observe that without explicit regularization, Grokking as reported in ( arXiv:2201.02177 ) almost exclusively happens at the onset of Slingshots, and is absent without it. While common and easily reproduced in more general settings, the Slingshot Mechanism does not follow from any known optimization theories that we are aware of, and can be easily overlooked without an in depth examination. Our work points to a surprising and useful inductive bias of adaptive gradient optimizers at late stages of training, calling for a revised theoretical analysis of their origin.

* Removed Tex formatting commands in title Title and Abstract

Via

Access Paper or Ask Questions

Implicit Greedy Rank Learning in Autoencoders via Overparameterized Linear Networks

Jul 02, 2021

Shih-Yu Sun, Vimal Thilak, Etai Littwin, Omid Saremi, Joshua M. Susskind

Figure 1 for Implicit Greedy Rank Learning in Autoencoders via Overparameterized Linear Networks

Figure 2 for Implicit Greedy Rank Learning in Autoencoders via Overparameterized Linear Networks

Figure 3 for Implicit Greedy Rank Learning in Autoencoders via Overparameterized Linear Networks

Figure 4 for Implicit Greedy Rank Learning in Autoencoders via Overparameterized Linear Networks

Abstract:Deep linear networks trained with gradient descent yield low rank solutions, as is typically studied in matrix factorization. In this paper, we take a step further and analyze implicit rank regularization in autoencoders. We show greedy learning of low-rank latent codes induced by a linear sub-network at the autoencoder bottleneck. We further propose orthogonal initialization and principled learning rate adjustment to mitigate sensitivity of training dynamics to spectral prior and linear depth. With linear autoencoders on synthetic data, our method converges stably to ground-truth latent code rank. With nonlinear autoencoders, our method converges to latent ranks optimal for downstream classification and image sampling.

Via

Access Paper or Ask Questions

Implicit Acceleration and Feature Learning in Infinitely Wide Neural Networks with Bottlenecks

Jul 02, 2021

Etai Littwin, Omid Saremi, Shuangfei Zhai, Vimal Thilak, Hanlin Goh, Joshua M. Susskind, Greg Yang

Figure 1 for Implicit Acceleration and Feature Learning in Infinitely Wide Neural Networks with Bottlenecks

Figure 2 for Implicit Acceleration and Feature Learning in Infinitely Wide Neural Networks with Bottlenecks

Figure 3 for Implicit Acceleration and Feature Learning in Infinitely Wide Neural Networks with Bottlenecks

Figure 4 for Implicit Acceleration and Feature Learning in Infinitely Wide Neural Networks with Bottlenecks

Abstract:We analyze the learning dynamics of infinitely wide neural networks with a finite sized bottle-neck. Unlike the neural tangent kernel limit, a bottleneck in an otherwise infinite width network al-lows data dependent feature learning in its bottle-neck representation. We empirically show that a single bottleneck in infinite networks dramatically accelerates training when compared to purely in-finite networks, with an improved overall performance. We discuss the acceleration phenomena by drawing similarities to infinitely wide deep linear models, where the acceleration effect of a bottleneck can be understood theoretically.

Via

Access Paper or Ask Questions