Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nikhil Ghosh

PLoP: Precise LoRA Placement for Efficient Finetuning of Large Models

Jun 25, 2025

Soufiane Hayou, Nikhil Ghosh, Bin Yu

Abstract:Low-Rank Adaptation (LoRA) is a widely used finetuning method for large models. Its small memory footprint allows practitioners to adapt large models to specific tasks at a fraction of the cost of full finetuning. Different modifications have been proposed to enhance its efficiency by, for example, setting the learning rate, the rank, and the initialization. Another improvement axis is adapter placement strategy: when using LoRA, practitioners usually pick module types to adapt with LoRA, such as Query and Key modules. Few works have studied the problem of adapter placement, with nonconclusive results: original LoRA paper suggested placing adapters in attention modules, while other works suggested placing them in the MLP modules. Through an intuitive theoretical analysis, we introduce PLoP (Precise LoRA Placement), a lightweight method that allows automatic identification of module types where LoRA adapters should be placed, given a pretrained model and a finetuning task. We demonstrate that PLoP consistently outperforms, and in the worst case competes, with commonly used placement strategies through comprehensive experiments on supervised finetuning and reinforcement learning for reasoning.

* TD,LR: A lightweight module type selection method for LoRA finetuning. PLoP gives precise placements for LoRA adapters for improved performance

Via

Access Paper or Ask Questions

The Impact of Initialization on LoRA Finetuning Dynamics

Jun 12, 2024

Soufiane Hayou, Nikhil Ghosh, Bin Yu

Figure 1 for The Impact of Initialization on LoRA Finetuning Dynamics

Figure 2 for The Impact of Initialization on LoRA Finetuning Dynamics

Figure 3 for The Impact of Initialization on LoRA Finetuning Dynamics

Figure 4 for The Impact of Initialization on LoRA Finetuning Dynamics

Abstract:In this paper, we study the role of initialization in Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021). Essentially, to start from the pretrained model as initialization for finetuning, one can either initialize B to zero and A to random (default initialization in PEFT package), or vice-versa. In both cases, the product BA is equal to zero at initialization, which makes finetuning starts from the pretrained model. These two initialization schemes are seemingly similar. They should in-principle yield the same performance and share the same optimal learning rate. We demonstrate that this is an incorrect intuition and that the first scheme (initializing B to zero and A to random) on average yields better performance compared to the other scheme. Our theoretical analysis shows that the reason behind this might be that the first initialization allows the use of larger learning rates (without causing output instability) compared to the second initialization, resulting in more efficient learning of the first scheme. We validate our results with extensive experiments on LLMs.

* TDLR: Different Initializations lead to completely different finetuning dynamics. One initialization (set A random and B zero) is generally better than the natural opposite initialization. arXiv admin note: text overlap with arXiv:2402.12354

Via

Access Paper or Ask Questions

LoRA+: Efficient Low Rank Adaptation of Large Models

Feb 19, 2024

Soufiane Hayou, Nikhil Ghosh, Bin Yu

Abstract:In this paper, we show that Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021) leads to suboptimal finetuning of models with large width (embedding dimension). This is due to the fact that adapter matrices A and B in LoRA are updated with the same learning rate. Using scaling arguments for large width networks, we demonstrate that using the same learning rate for A and B does not allow efficient feature learning. We then show that this suboptimality of LoRA can be corrected simply by setting different learning rates for the LoRA adapter matrices A and B with a well-chosen ratio. We call this proposed algorithm LoRA$+$. In our extensive experiments, LoRA$+$ improves performance (1-2 $\%$ improvements) and finetuning speed (up to $\sim$ 2X SpeedUp), at the same computational cost as LoRA.

* 27 pages

Via

Access Paper or Ask Questions

More is Better in Modern Machine Learning: when Infinite Overparameterization is Optimal and Overfitting is Obligatory

Nov 27, 2023

James B. Simon, Dhruva Karkada, Nikhil Ghosh, Mikhail Belkin

Figure 1 for More is Better in Modern Machine Learning: when Infinite Overparameterization is Optimal and Overfitting is Obligatory

Figure 2 for More is Better in Modern Machine Learning: when Infinite Overparameterization is Optimal and Overfitting is Obligatory

Figure 3 for More is Better in Modern Machine Learning: when Infinite Overparameterization is Optimal and Overfitting is Obligatory

Figure 4 for More is Better in Modern Machine Learning: when Infinite Overparameterization is Optimal and Overfitting is Obligatory

Abstract:In our era of enormous neural networks, empirical progress has been driven by the philosophy that more is better. Recent deep learning practice has found repeatedly that larger model size, more data, and more computation (resulting in lower training loss) improves performance. In this paper, we give theoretical backing to these empirical observations by showing that these three properties hold in random feature (RF) regression, a class of models equivalent to shallow networks with only the last layer trained. Concretely, we first show that the test risk of RF regression decreases monotonically with both the number of features and the number of samples, provided the ridge penalty is tuned optimally. In particular, this implies that infinite width RF architectures are preferable to those of any finite width. We then proceed to demonstrate that, for a large class of tasks characterized by powerlaw eigenstructure, training to near-zero training loss is obligatory: near-optimal performance can only be achieved when the training error is much smaller than the test error. Grounding our theory in real-world data, we find empirically that standard computer vision tasks with convolutional neural tangent kernels clearly fall into this class. Taken together, our results tell a simple, testable story of the benefits of overparameterization, overfitting, and more data in random feature models.

Via

Access Paper or Ask Questions

The Effect of SGD Batch Size on Autoencoder Learning: Sparsity, Sharpness, and Feature Learning

Aug 06, 2023

Nikhil Ghosh, Spencer Frei, Wooseok Ha, Bin Yu

Figure 1 for The Effect of SGD Batch Size on Autoencoder Learning: Sparsity, Sharpness, and Feature Learning

Figure 2 for The Effect of SGD Batch Size on Autoencoder Learning: Sparsity, Sharpness, and Feature Learning

Abstract:In this work, we investigate the dynamics of stochastic gradient descent (SGD) when training a single-neuron autoencoder with linear or ReLU activation on orthogonal data. We show that for this non-convex problem, randomly initialized SGD with a constant step size successfully finds a global minimum for any batch size choice. However, the particular global minimum found depends upon the batch size. In the full-batch setting, we show that the solution is dense (i.e., not sparse) and is highly aligned with its initialized direction, showing that relatively little feature learning occurs. On the other hand, for any batch size strictly smaller than the number of samples, SGD finds a global minimum which is sparse and nearly orthogonal to its initialization, showing that the randomness of stochastic gradients induces a qualitatively different type of "feature selection" in this setting. Moreover, if we measure the sharpness of the minimum by the trace of the Hessian, the minima found with full batch gradient descent are flatter than those found with strictly smaller batch sizes, in contrast to previous works which suggest that large batches lead to sharper minima. To prove convergence of SGD with a constant step size, we introduce a powerful tool from the theory of non-homogeneous random walks which may be of independent interest.

Via

Access Paper or Ask Questions

The Power of External Memory in Increasing Predictive Model Capacity

Jan 31, 2023

Cenk Baykal, Dylan J Cutler, Nishanth Dikkala, Nikhil Ghosh, Rina Panigrahy, Xin Wang

Figure 1 for The Power of External Memory in Increasing Predictive Model Capacity

Figure 2 for The Power of External Memory in Increasing Predictive Model Capacity

Figure 3 for The Power of External Memory in Increasing Predictive Model Capacity

Figure 4 for The Power of External Memory in Increasing Predictive Model Capacity

Abstract:One way of introducing sparsity into deep networks is by attaching an external table of parameters that is sparsely looked up at different layers of the network. By storing the bulk of the parameters in the external table, one can increase the capacity of the model without necessarily increasing the inference time. Two crucial questions in this setting are then: what is the lookup function for accessing the table and how are the contents of the table consumed? Prominent methods for accessing the table include 1) using words/wordpieces token-ids as table indices, 2) LSH hashing the token vector in each layer into a table of buckets, and 3) learnable softmax style routing to a table entry. The ways to consume the contents include adding/concatenating to input representation, and using the contents as expert networks that specialize to different inputs. In this work, we conduct rigorous experimental evaluations of existing ideas and their combinations. We also introduce a new method, alternating updates, that enables access to an increased token dimension without increasing the computation time, and demonstrate its effectiveness in language modeling.

* arXiv admin note: substantial text overlap with arXiv:2301.13310

Via

Access Paper or Ask Questions

Alternating Updates for Efficient Transformers

Jan 30, 2023

Cenk Baykal, Dylan Cutler, Nishanth Dikkala, Nikhil Ghosh, Rina Panigrahy, Xin Wang

Figure 1 for Alternating Updates for Efficient Transformers

Figure 2 for Alternating Updates for Efficient Transformers

Figure 3 for Alternating Updates for Efficient Transformers

Figure 4 for Alternating Updates for Efficient Transformers

Abstract:It is well established that increasing scale in deep transformer networks leads to improved quality and performance. This increase in scale often comes with an increase in compute cost and inference latency. Consequently, research into methods which help realize the benefits of increased scale without leading to an increase in the compute cost becomes important. We introduce Alternating Updates (AltUp), a simple-to-implement method to increase a model's capacity without the computational burden. AltUp enables the widening of the learned representation without increasing the computation time by working on a subblock of the representation at each layer. Our experiments on various transformer models and language tasks demonstrate the consistent effectiveness of alternating updates on a diverse set of benchmarks. Finally, we present extensions of AltUp to the sequence dimension, and demonstrate how AltUp can be synergistically combined with existing approaches, such as Sparse Mixture-of-Experts models, to obtain efficient models with even higher capacity.

Via

Access Paper or Ask Questions

A Universal Trade-off Between the Model Size, Test Loss, and Training Loss of Linear Predictors

Jul 23, 2022

Nikhil Ghosh, Mikhail Belkin

Figure 1 for A Universal Trade-off Between the Model Size, Test Loss, and Training Loss of Linear Predictors

Abstract:In this work we establish an algorithm and distribution independent non-asymptotic trade-off between the model size, excess test loss, and training loss of linear predictors. Specifically, we show that models that perform well on the test data (have low excess loss) are either "classical" -- have training loss close to the noise level, or are "modern" -- have a much larger number of parameters compared to the minimum needed to fit the training data exactly. We also provide a more precise asymptotic analysis when the limiting spectral distribution of the whitened features is Marchenko-Pastur. Remarkably, while the Marchenko-Pastur analysis is far more precise near the interpolation peak, where the number of parameters is just enough to fit the training data, in settings of most practical interest it differs from the distribution independent bound by only a modest multiplicative constant.

Via

Access Paper or Ask Questions

Deconstructing Distributions: A Pointwise Framework of Learning

Feb 20, 2022

Gal Kaplun, Nikhil Ghosh, Saurabh Garg, Boaz Barak, Preetum Nakkiran

Figure 1 for Deconstructing Distributions: A Pointwise Framework of Learning

Figure 2 for Deconstructing Distributions: A Pointwise Framework of Learning

Figure 3 for Deconstructing Distributions: A Pointwise Framework of Learning

Figure 4 for Deconstructing Distributions: A Pointwise Framework of Learning

Abstract:In machine learning, we traditionally evaluate the performance of a single model, averaged over a collection of test inputs. In this work, we propose a new approach: we measure the performance of a collection of models when evaluated on a $\textit{single input point}$. Specifically, we study a point's $\textit{profile}$: the relationship between models' average performance on the test distribution and their pointwise performance on this individual point. We find that profiles can yield new insights into the structure of both models and data -- in and out-of-distribution. For example, we empirically show that real data distributions consist of points with qualitatively different profiles. On one hand, there are "compatible" points with strong correlation between the pointwise and average performance. On the other hand, there are points with weak and even $\textit{negative}$ correlation: cases where improving overall model accuracy actually $\textit{hurts}$ performance on these inputs. We prove that these experimental observations are inconsistent with the predictions of several simplified models of learning proposed in prior work. As an application, we use profiles to construct a dataset we call CIFAR-10-NEG: a subset of CINIC-10 such that for standard models, accuracy on CIFAR-10-NEG is $\textit{negatively correlated}$ with accuracy on CIFAR-10 test. This illustrates, for the first time, an OOD dataset that completely inverts "accuracy-on-the-line" (Miller, Taori, Raghunathan, Sagawa, Koh, Shankar, Liang, Carmon, and Schmidt 2021)

* GK and NG contributed equally

Via

Access Paper or Ask Questions

The Three Stages of Learning Dynamics in High-Dimensional Kernel Methods

Nov 13, 2021

Nikhil Ghosh, Song Mei, Bin Yu

Figure 1 for The Three Stages of Learning Dynamics in High-Dimensional Kernel Methods

Figure 2 for The Three Stages of Learning Dynamics in High-Dimensional Kernel Methods

Figure 3 for The Three Stages of Learning Dynamics in High-Dimensional Kernel Methods

Figure 4 for The Three Stages of Learning Dynamics in High-Dimensional Kernel Methods

Abstract:To understand how deep learning works, it is crucial to understand the training dynamics of neural networks. Several interesting hypotheses about these dynamics have been made based on empirically observed phenomena, but there exists a limited theoretical understanding of when and why such phenomena occur. In this paper, we consider the training dynamics of gradient flow on kernel least-squares objectives, which is a limiting dynamics of SGD trained neural networks. Using precise high-dimensional asymptotics, we characterize the dynamics of the fitted model in two "worlds": in the Oracle World the model is trained on the population distribution and in the Empirical World the model is trained on a sampled dataset. We show that under mild conditions on the kernel and $L^2$ target regression function the training dynamics undergo three stages characterized by the behaviors of the models in the two worlds. Our theoretical results also mathematically formalize some interesting deep learning phenomena. Specifically, in our setting we show that SGD progressively learns more complex functions and that there is a "deep bootstrap" phenomenon: during the second stage, the test error of both worlds remain close despite the empirical training error being much smaller. Finally, we give a concrete example comparing the dynamics of two different kernels which shows that faster training is not necessary for better generalization.

Via

Access Paper or Ask Questions