Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chandramouli Shama Sastry

Accelerating Neural Network Training: An Analysis of the AlgoPerf Competition

Feb 20, 2025

Priya Kasimbeg, Frank Schneider, Runa Eschenhagen, Juhan Bae, Chandramouli Shama Sastry, Mark Saroufim, Boyuan Feng, Less Wright, Edward Z. Yang, Zachary Nado(+4 more)

Abstract:The goal of the AlgoPerf: Training Algorithms competition is to evaluate practical speed-ups in neural network training achieved solely by improving the underlying training algorithms. In the external tuning ruleset, submissions must provide workload-agnostic hyperparameter search spaces, while in the self-tuning ruleset they must be completely hyperparameter-free. In both rulesets, submissions are compared on time-to-result across multiple deep learning workloads, training on fixed hardware. This paper presents the inaugural AlgoPerf competition's results, which drew 18 diverse submissions from 10 teams. Our investigation reveals several key findings: (1) The winning submission in the external tuning ruleset, using Distributed Shampoo, demonstrates the effectiveness of non-diagonal preconditioning over popular methods like Adam, even when compared on wall-clock runtime. (2) The winning submission in the self-tuning ruleset, based on the Schedule Free AdamW algorithm, demonstrates a new level of effectiveness for completely hyperparameter-free training algorithms. (3) The top-scoring submissions were surprisingly robust to workload changes. We also discuss the engineering challenges encountered in ensuring a fair comparison between different training algorithms. These results highlight both the significant progress so far, and the considerable room for further improvements.

* ICLR 2025; 23 pages, 5 figures, 8 tables

Via

Access Paper or Ask Questions

Test-Time Training for Depression Detection

Apr 07, 2024

Sri Harsha Dumpala, Chandramouli Shama Sastry, Rudolf Uher, Sageev Oore

Figure 1 for Test-Time Training for Depression Detection

Figure 2 for Test-Time Training for Depression Detection

Figure 3 for Test-Time Training for Depression Detection

Figure 4 for Test-Time Training for Depression Detection

Abstract:Previous works on depression detection use datasets collected in similar environments to train and test the models. In practice, however, the train and test distributions cannot be guaranteed to be identical. Distribution shifts can be introduced due to variations such as recording environment (e.g., background noise) and demographics (e.g., gender, age, etc). Such distributional shifts can surprisingly lead to severe performance degradation of the depression detection models. In this paper, we analyze the application of test-time training (TTT) to improve robustness of models trained for depression detection. When compared to regular testing of the models, we find TTT can significantly improve the robustness of the model under a variety of distributional shifts introduced due to: (a) background-noise, (b) gender-bias, and (c) data collection and curation procedure (i.e., train and test samples are from separate datasets).

Via

Access Paper or Ask Questions

Benchmarking Neural Network Training Algorithms

Jun 12, 2023

George E. Dahl, Frank Schneider, Zachary Nado, Naman Agarwal, Chandramouli Shama Sastry, Philipp Hennig, Sourabh Medapati, Runa Eschenhagen, Priya Kasimbeg, Daniel Suo(+15 more)

Figure 1 for Benchmarking Neural Network Training Algorithms

Figure 2 for Benchmarking Neural Network Training Algorithms

Figure 3 for Benchmarking Neural Network Training Algorithms

Figure 4 for Benchmarking Neural Network Training Algorithms

Abstract:Training algorithms, broadly construed, are an essential part of every deep learning pipeline. Training algorithm improvements that speed up training across a wide variety of workloads (e.g., better update rules, tuning protocols, learning rate schedules, or data selection schemes) could save time, save computational resources, and lead to better, more accurate, models. Unfortunately, as a community, we are currently unable to reliably identify training algorithm improvements, or even determine the state-of-the-art training algorithm. In this work, using concrete experiments, we argue that real progress in speeding up training requires new benchmarks that resolve three basic challenges faced by empirical comparisons of training algorithms: (1) how to decide when training is complete and precisely measure training time, (2) how to handle the sensitivity of measurements to exact workload details, and (3) how to fairly compare algorithms that require hyperparameter tuning. In order to address these challenges, we introduce a new, competitive, time-to-result benchmark using multiple workloads running on fixed hardware, the AlgoPerf: Training Algorithms benchmark. Our benchmark includes a set of workload variants that make it possible to detect benchmark submissions that are more robust to workload changes than current widely-used methods. Finally, we evaluate baseline submissions constructed using various optimizers that represent current practice, as well as other optimizers that have recently received attention in the literature. These baseline results collectively demonstrate the feasibility of our benchmark, show that non-trivial gaps between methods exist, and set a provisional state-of-the-art for future benchmark submissions to try and surpass.

* 102 pages, 8 figures, 41 tables

Via

Access Paper or Ask Questions

Efficient CDF Approximations for Normalizing Flows

Feb 23, 2022

Chandramouli Shama Sastry, Andreas Lehrmann, Marcus Brubaker, Alexander Radovic

Figure 1 for Efficient CDF Approximations for Normalizing Flows

Figure 2 for Efficient CDF Approximations for Normalizing Flows

Figure 3 for Efficient CDF Approximations for Normalizing Flows

Figure 4 for Efficient CDF Approximations for Normalizing Flows

Abstract:Normalizing flows model a complex target distribution in terms of a bijective transform operating on a simple base distribution. As such, they enable tractable computation of a number of important statistical quantities, particularly likelihoods and samples. Despite these appealing properties, the computation of more complex inference tasks, such as the cumulative distribution function (CDF) over a complex region (e.g., a polytope) remains challenging. Traditional CDF approximations using Monte-Carlo techniques are unbiased but have unbounded variance and low sample efficiency. Instead, we build upon the diffeomorphic properties of normalizing flows and leverage the divergence theorem to estimate the CDF over a closed region in target space in terms of the flux across its \emph{boundary}, as induced by the normalizing flow. We describe both deterministic and stochastic instances of this estimator: while the deterministic variant iteratively improves the estimate by strategically subdividing the boundary, the stochastic variant provides unbiased estimates. Our experiments on popular flow architectures and UCI benchmark datasets show a marked improvement in sample efficiency as compared to traditional estimators.

Via

Access Paper or Ask Questions

Musical Speech: A Transformer-based Composition Tool

Aug 02, 2021

Jason d'Eon, Sri Harsha Dumpala, Chandramouli Shama Sastry, Dani Oore, Sageev Oore

Figure 1 for Musical Speech: A Transformer-based Composition Tool

Figure 2 for Musical Speech: A Transformer-based Composition Tool

Figure 3 for Musical Speech: A Transformer-based Composition Tool

Figure 4 for Musical Speech: A Transformer-based Composition Tool

Abstract:In this paper, we propose a new compositional tool that will generate a musical outline of speech recorded/provided by the user for use as a musical building block in their compositions. The tool allows any user to use their own speech to generate musical material, while still being able to hear the direct connection between their recorded speech and the resulting music. The tool is built on our proposed pipeline. This pipeline begins with speech-based signal processing, after which some simple musical heuristics are applied, and finally these pre-processed signals are passed through Transformer models trained on new musical tasks. We illustrate the effectiveness of our pipeline -- which does not require a paired dataset for training -- through examples of music created by musicians making use of our tool.

* NeurIPS 2020 Demonstration Track; extended for PMLR

Via

Access Paper or Ask Questions

Detecting Out-of-Distribution Examples with In-distribution Examples and Gram Matrices

Jan 09, 2020

Chandramouli Shama Sastry, Sageev Oore

Figure 1 for Detecting Out-of-Distribution Examples with In-distribution Examples and Gram Matrices

Figure 2 for Detecting Out-of-Distribution Examples with In-distribution Examples and Gram Matrices

Figure 3 for Detecting Out-of-Distribution Examples with In-distribution Examples and Gram Matrices

Figure 4 for Detecting Out-of-Distribution Examples with In-distribution Examples and Gram Matrices

Abstract:When presented with Out-of-Distribution (OOD) examples, deep neural networks yield confident, incorrect predictions. Detecting OOD examples is challenging, and the potential risks are high. In this paper, we propose to detect OOD examples by identifying inconsistencies between activity patterns and class predicted. We find that characterizing activity patterns by Gram matrices and identifying anomalies in gram matrix values can yield high OOD detection rates. We identify anomalies in the gram matrices by simply comparing each value with its respective range observed over the training data. Unlike many approaches, this can be used with any pre-trained softmax classifier and does not require access to OOD data for fine-tuning hyperparameters, nor does it require OOD access for inferring parameters. The method is applicable across a variety of architectures and vision datasets and, for the important and surprisingly hard task of detecting far-from-distribution out-of-distribution examples, it generally performs better than or equal to state-of-the-art OOD detection methods (including those that do assume access to OOD examples).

* NeurIPS 2019 Workshop on Safety and Robustness in Decision Making

Via

Access Paper or Ask Questions