Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Puneesh Deora

How Muon's Spectral Design Benefits Generalization: A Study on Imbalanced Data

Oct 27, 2025

Bhavya Vasudeva, Puneesh Deora, Yize Zhao, Vatsal Sharan, Christos Thrampoulidis

Abstract:The growing adoption of spectrum-aware matrix-valued optimizers such as Muon and Shampoo in deep learning motivates a systematic study of their generalization properties and, in particular, when they might outperform competitive algorithms. We approach this question by introducing appropriate simplifying abstractions as follows: First, we use imbalanced data as a testbed. Second, we study the canonical form of such optimizers, which is Spectral Gradient Descent (SpecGD) -- each update step is $UV^T$ where $U\Sigma V^T$ is the truncated SVD of the gradient. Third, within this framework we identify a canonical setting for which we precisely quantify when SpecGD outperforms vanilla Euclidean GD. For a Gaussian mixture data model and both linear and bilinear models, we show that unlike GD, which prioritizes learning dominant principal components of the data first, SpecGD learns all principal components of the data at equal rates. We demonstrate how this translates to a growing gap in balanced accuracy favoring SpecGD early in training and further show that the gap remains consistent even when the GD counterpart uses adaptive step-sizes via normalization. By extending the analysis to deep linear models, we show that depth amplifies these effects. We empirically verify our theoretical findings on a variety of imbalanced datasets. Our experiments compare practical variants of spectral methods, like Muon and Shampoo, against their Euclidean counterparts and Adam. The results validate our findings that these spectral optimizers achieve superior generalization by promoting a more balanced learning of the data's underlying components.

* 32 pages, 28 figures

Via

Access Paper or Ask Questions

In-Context Occam's Razor: How Transformers Prefer Simpler Hypotheses on the Fly

Jun 24, 2025

Puneesh Deora, Bhavya Vasudeva, Tina Behnia, Christos Thrampoulidis

Abstract:In-context learning (ICL) enables transformers to adapt to new tasks through contextual examples without parameter updates. While existing research has typically studied ICL in fixed-complexity environments, practical language models encounter tasks spanning diverse complexity levels. This paper investigates how transformers navigate hierarchical task structures where higher-complexity categories can perfectly represent any pattern generated by simpler ones. We design well-controlled testbeds based on Markov chains and linear regression that reveal transformers not only identify the appropriate complexity level for each task but also accurately infer the corresponding parameters--even when the in-context examples are compatible with multiple complexity hypotheses. Notably, when presented with data generated by simpler processes, transformers consistently favor the least complex sufficient explanation. We theoretically explain this behavior through a Bayesian framework, demonstrating that transformers effectively implement an in-context Bayesian Occam's razor by balancing model fit against complexity penalties. We further ablate on the roles of model size, training mixture distribution, inference context length, and architecture. Finally, we validate this Occam's razor-like inductive bias on a pretrained GPT-4 model with Boolean-function tasks as case study, suggesting it may be inherent to transformers trained on diverse task distributions.

* 28 pages, 19 figures

Via

Access Paper or Ask Questions

Implicit Bias and Fast Convergence Rates for Self-attention

Feb 08, 2024

Bhavya Vasudeva, Puneesh Deora, Christos Thrampoulidis

Abstract:Self-attention, the core mechanism of transformers, distinguishes them from traditional neural networks and drives their outstanding performance. Towards developing the fundamental optimization principles of self-attention, we investigate the implicit bias of gradient descent (GD) in training a self-attention layer with fixed linear decoder in binary classification. Drawing inspiration from the study of GD in linear logistic regression over separable data, recent work demonstrates that as the number of iterations $t$ approaches infinity, the key-query matrix $W_t$ converges locally (with respect to the initialization direction) to a hard-margin SVM solution $W_{mm}$. Our work enhances this result in four aspects. Firstly, we identify non-trivial data settings for which convergence is provably global, thus shedding light on the optimization landscape. Secondly, we provide the first finite-time convergence rate for $W_t$ to $W_{mm}$, along with quantifying the rate of sparsification in the attention map. Thirdly, through an analysis of normalized GD and Polyak step-size, we demonstrate analytically that adaptive step-size rules can accelerate the convergence of self-attention. Additionally, we remove the restriction of prior work on a fixed linear decoder. Our results reinforce the implicit-bias perspective of self-attention and strengthen its connections to implicit-bias in linear logistic regression, despite the intricate non-convex nature of the former.

* 41 pages, 7 figures

Via

Access Paper or Ask Questions

On the Optimization and Generalization of Multi-head Attention

Oct 19, 2023

Puneesh Deora, Rouzbeh Ghaderi, Hossein Taheri, Christos Thrampoulidis

Figure 1 for On the Optimization and Generalization of Multi-head Attention

Figure 2 for On the Optimization and Generalization of Multi-head Attention

Figure 3 for On the Optimization and Generalization of Multi-head Attention

Figure 4 for On the Optimization and Generalization of Multi-head Attention

Abstract:The training and generalization dynamics of the Transformer's core mechanism, namely the Attention mechanism, remain under-explored. Besides, existing analyses primarily focus on single-head attention. Inspired by the demonstrated benefits of overparameterization when training fully-connected networks, we investigate the potential optimization and generalization advantages of using multiple attention heads. Towards this goal, we derive convergence and generalization guarantees for gradient-descent training of a single-layer multi-head self-attention model, under a suitable realizability condition on the data. We then establish primitive conditions on the initialization that ensure realizability holds. Finally, we demonstrate that these conditions are satisfied for a simple tokenized-mixture model. We expect the analysis can be extended to various data-model and architecture variations.

* 48 page; presented in the Workshop on High-dimensional Learning Dynamics, ICML 2023

Via

Access Paper or Ask Questions

LoOp: Looking for Optimal Hard Negative Embeddings for Deep Metric Learning

Aug 20, 2021

Bhavya Vasudeva, Puneesh Deora, Saumik Bhattacharya, Umapada Pal, Sukalpa Chanda

Figure 1 for LoOp: Looking for Optimal Hard Negative Embeddings for Deep Metric Learning

Figure 2 for LoOp: Looking for Optimal Hard Negative Embeddings for Deep Metric Learning

Figure 3 for LoOp: Looking for Optimal Hard Negative Embeddings for Deep Metric Learning

Figure 4 for LoOp: Looking for Optimal Hard Negative Embeddings for Deep Metric Learning

Abstract:Deep metric learning has been effectively used to learn distance metrics for different visual tasks like image retrieval, clustering, etc. In order to aid the training process, existing methods either use a hard mining strategy to extract the most informative samples or seek to generate hard synthetics using an additional network. Such approaches face different challenges and can lead to biased embeddings in the former case, and (i) harder optimization (ii) slower training speed (iii) higher model complexity in the latter case. In order to overcome these challenges, we propose a novel approach that looks for optimal hard negatives (LoOp) in the embedding space, taking full advantage of each tuple by calculating the minimum distance between a pair of positives and a pair of negatives. Unlike mining-based methods, our approach considers the entire space between pairs of embeddings to calculate the optimal hard negatives. Extensive experiments combining our approach and representative metric learning losses reveal a significant boost in performance on three benchmark datasets.

* 17 pages, 9 figures, 5 tables. Accepted at The IEEE/CVF International Conference on Computer Vision (ICCV) 2021

Via

Access Paper or Ask Questions

AIM 2020 Challenge on Learned Image Signal Processing Pipeline

Nov 10, 2020

Andrey Ignatov, Radu Timofte, Zhilu Zhang, Ming Liu, Haolin Wang, Wangmeng Zuo, Jiawei Zhang, Ruimao Zhang, Zhanglin Peng, Sijie Ren(+29 more)

Figure 1 for AIM 2020 Challenge on Learned Image Signal Processing Pipeline

Figure 2 for AIM 2020 Challenge on Learned Image Signal Processing Pipeline

Figure 3 for AIM 2020 Challenge on Learned Image Signal Processing Pipeline

Figure 4 for AIM 2020 Challenge on Learned Image Signal Processing Pipeline

Abstract:This paper reviews the second AIM learned ISP challenge and provides the description of the proposed solutions and results. The participating teams were solving a real-world RAW-to-RGB mapping problem, where to goal was to map the original low-quality RAW images captured by the Huawei P20 device to the same photos obtained with the Canon 5D DSLR camera. The considered task embraced a number of complex computer vision subtasks, such as image demosaicing, denoising, white balancing, color and contrast correction, demoireing, etc. The target metric used in this challenge combined fidelity scores (PSNR and SSIM) with solutions' perceptual results measured in a user study. The proposed solutions significantly improved the baseline results, defining the state-of-the-art for practical image signal processing pipeline modeling.

* Published in ECCV 2020 Workshops (Advances in Image Manipulation), https://data.vision.ee.ethz.ch/cvl/aim20/

Via

Access Paper or Ask Questions

Co-VeGAN: Complex-Valued Generative Adversarial Network for Compressive Sensing MR Image Reconstruction

Feb 24, 2020

Bhavya Vasudeva, Puneesh Deora, Saumik Bhattacharya, Pyari Mohan Pradhan

Figure 1 for Co-VeGAN: Complex-Valued Generative Adversarial Network for Compressive Sensing MR Image Reconstruction

Figure 2 for Co-VeGAN: Complex-Valued Generative Adversarial Network for Compressive Sensing MR Image Reconstruction

Figure 3 for Co-VeGAN: Complex-Valued Generative Adversarial Network for Compressive Sensing MR Image Reconstruction

Figure 4 for Co-VeGAN: Complex-Valued Generative Adversarial Network for Compressive Sensing MR Image Reconstruction

Abstract:Compressive sensing (CS) is widely used to reduce the image acquisition time of magnetic resonance imaging (MRI). Though CS based undersampling has numerous benefits, like high quality images with less motion artefacts, low storage requirement, etc., the reconstruction of the image from the CS-undersampled data is an ill-posed inverse problem which requires extensive computation and resources. In this paper, we propose a novel deep network that can process complex-valued input to perform high-quality reconstruction. Our model is based on generative adversarial network (GAN) that uses residual-in-residual dense blocks in a modified U-net generator with patch based discriminator. We introduce a wavelet based loss in the complex GAN model for better reconstruction quality. Extensive analyses on different datasets demonstrate that the proposed model significantly outperforms the existing CS reconstruction techniques in terms of peak signal-to-noise ratio and structural similarity index.

Via

Access Paper or Ask Questions

Robust Compressive Sensing MRI Reconstruction using Generative Adversarial Networks

Oct 14, 2019

Puneesh Deora, Bhavya Vasudeva, Saumik Bhattacharya, Pyari Mohan Pradhan

Figure 1 for Robust Compressive Sensing MRI Reconstruction using Generative Adversarial Networks

Figure 2 for Robust Compressive Sensing MRI Reconstruction using Generative Adversarial Networks

Figure 3 for Robust Compressive Sensing MRI Reconstruction using Generative Adversarial Networks

Figure 4 for Robust Compressive Sensing MRI Reconstruction using Generative Adversarial Networks

Abstract:Compressive sensing magnetic resonance imaging (CS-MRI) accelerates the acquisition of MR images by breaking the Nyquist sampling limit. In this work, a novel generative adversarial network (GAN) based framework for CS-MRI reconstruction is proposed. Leveraging a combination of patchGAN discriminator and structural similarity index based loss, our model focuses on preserving high frequency content as well as fine textural details in the reconstructed image. Dense and residual connections have been incorporated in a U-net based generator architecture to allow easier transfer of information as well as variable network length. We show that our algorithm outperforms state-of-the-art methods in terms of quality of reconstruction and robustness to noise. Also, the reconstruction time, which is of the order of milliseconds, makes it highly suitable for real-time clinical use.

Via

Access Paper or Ask Questions