Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sunny Sanyal

Upweighting Easy Samples in Fine-Tuning Mitigates Forgetting

Feb 05, 2025

Sunny Sanyal, Hayden Prairie, Rudrajit Das, Ali Kavis, Sujay Sanghavi

Abstract:Fine-tuning a pre-trained model on a downstream task often degrades its original capabilities, a phenomenon known as "catastrophic forgetting". This is especially an issue when one does not have access to the data and recipe used to develop the pre-trained model. Under this constraint, most existing methods for mitigating forgetting are inapplicable. To address this challenge, we propose a sample weighting scheme for the fine-tuning data solely based on the pre-trained model's losses. Specifically, we upweight the easy samples on which the pre-trained model's loss is low and vice versa to limit the drift from the pre-trained model. Our approach is orthogonal and yet complementary to existing methods; while such methods mostly operate on parameter or gradient space, we concentrate on the sample space. We theoretically analyze the impact of fine-tuning with our method in a linear setting, showing that it stalls learning in a certain subspace which inhibits overfitting to the target task. We empirically demonstrate the efficacy of our method on both language and vision tasks. As an example, when fine-tuning Gemma 2 2B on MetaMathQA, our method results in only a $0.8\%$ drop in accuracy on GSM8K (another math dataset) compared to standard fine-tuning, while preserving $5.4\%$ more accuracy on the pre-training datasets. Our code is publicly available at https://github.com/sanyalsunny111/FLOW_finetuning .

* 49 pages, 4 figures, 12 tables. Code available at https://github.com/sanyalsunny111/FLOW_finetuning

Via

Access Paper or Ask Questions

DataComp-LM: In search of the next generation of training sets for language models

Jun 18, 2024

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora(+49 more)

Figure 1 for DataComp-LM: In search of the next generation of training sets for language models

Figure 2 for DataComp-LM: In search of the next generation of training sets for language models

Figure 3 for DataComp-LM: In search of the next generation of training sets for language models

Figure 4 for DataComp-LM: In search of the next generation of training sets for language models

Abstract:We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation.

* Project page: https://www.datacomp.ai/dclm/

Via

Access Paper or Ask Questions

Pre-training Small Base LMs with Fewer Tokens

Apr 12, 2024

Sunny Sanyal, Sujay Sanghavi, Alexandros G. Dimakis

Abstract:We study the effectiveness of a simple approach to develop a small base language model (LM) starting from an existing large base LM: first inherit a few transformer blocks from the larger LM, and then train this smaller model on a very small subset (0.1\%) of the raw pretraining data of the larger model. We call our simple recipe Inheritune and first demonstrate it for building a small base LM with 1.5B parameters using 1B tokens (and a starting few layers of larger LM of 3B parameters); we do this using a single A6000 GPU for less than half a day. Across 9 diverse evaluation datasets as well as the MMLU benchmark, the resulting model compares favorably to publicly available base models of 1B-2B size, some of which have been trained using 50-1000 times more tokens. We investigate Inheritune in a slightly different setting where we train small LMs utilizing larger LMs and their full pre-training dataset. Here we show that smaller LMs trained utilizing some of the layers of GPT2-medium (355M) and GPT-2-large (770M) can effectively match the val loss of their bigger counterparts when trained from scratch for the same number of training steps on OpenWebText dataset with 9B tokens. We analyze our recipe with extensive experiments and demonstrate it efficacy on diverse settings. Our code is available at https://github.com/sanyalsunny111/LLM-Inheritune.

* 15 pages, 6 figures, 10 tables

Via

Access Paper or Ask Questions

Understanding the Effectiveness of Early Weight Averaging for Training Large Language Models

Jun 05, 2023

Sunny Sanyal, Jean Kaddour, Abhishek Kumar, Sujay Sanghavi

Abstract:Training LLMs is expensive, and recent evidence indicates training all the way to convergence is inefficient. In this paper, we investigate the ability of a simple idea, checkpoint averaging along the trajectory of a training run to improve the quality of models before they have converged. This approach incurs no extra cost during training or inference. Specifically, we analyze the training trajectories of Pythia LLMs with 1 to 12 billion parameters and demonstrate that, particularly during the early to mid stages of training, this idea accelerates convergence and improves both test and zero-shot generalization. Loss spikes are a well recognized problem in LLM training; in our analysis we encountered two instances of this in the underlying trajectories, and both instances were mitigated by our averaging. For a 6.9B parameter LLM, for example, our early weight averaging recipe can save upto 4200 hours of GPU time, which corresponds to significant savings in cloud compute costs.

* 17 pages, 12 figures, under review

Via

Access Paper or Ask Questions

Data Aggregation Techniques for Internet of Things

Jul 24, 2019

Sunny Sanyal

Figure 1 for Data Aggregation Techniques for Internet of Things

Figure 2 for Data Aggregation Techniques for Internet of Things

Figure 3 for Data Aggregation Techniques for Internet of Things

Abstract:The goal of this dissertation is to design efficient data aggregation frameworks for massive IoT networks in different scenarios to support the proper functioning of IoT analytics layer. This dissertation includes modern algorithmic frameworks such as non convex optimization, machine learning, stochastic matrix perturbation theory and federated filtering along with modern computing infrastructure such as fog computing and cloud computing. The development of such an ambitious design involves many open challenges, this proposal envisions three major open challenges for IoT data aggregation: first, severe resource constraints of IoT nodes due to limited power and computational ability, second, the highly uncertain (unreliable) raw IoT data is not fit for decisionmaking and third, network latency and privacy issue for critical applications. This dissertation presents three independent novel approaches for distinct scenarios to solve one or more aforementioned open challenges. The first approach focuses on energy efficient routing; discusses a clustering protocol based on device to device communication for both stationary and mobile IoT nodes. The second approach focuses on processing uncertain raw IoT data; presents an IoT data aggregation scheme to improve the quality of raw IoT data. Finally, the third approach focuses on power loss due to communication overhead and privacy issues for medical IoT devices (IoMT); describes a prediction based data aggregation framework for massive IoMT devices.

* This is the master's thesis of Mr. Sunny Sanyal, who graduated from Chongqing University of Posts and Telecommunications, Chongqing, China. This thesis document has received the Excellent Master's thesis Award 2019 (includes all departments) from the University. All the chapters in this thesis are published in various venues

Via

Access Paper or Ask Questions