Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tushaar Gangavarapu

Data Distribution as a Lever for Guiding Optimizers Toward Superior Generalization in LLMs

Jan 31, 2026

Tushaar Gangavarapu, Jiping Li, Christopher Vattheuer, Zhangyang Wang, Baharan Mirzasoleiman

Abstract:Can modifying the training data distribution guide optimizers toward solutions with improved generalization when training large language models (LLMs)? In this work, we theoretically analyze an in-context linear regression model with multi-head linear self-attention, and compare the training dynamics of two gradient based optimizers, namely gradient descent (GD) and sharpness-aware minimization (SAM), the latter exhibiting superior generalization properties but is prohibitively expensive for training even medium-sized LLMs. We show, for the first time, that SAM induces a lower simplicity bias (SB)-the tendency of an optimizer to preferentially learn simpler features earlier in training-and identify this reduction as a key factor underlying its improved generalization performance. Motivated by this insight, we demonstrate that altering the training data distribution by upsampling or augmenting examples learned later in training similarly reduces SB and leads to improved generalization. Our extensive experiments show that our strategy improves the performance of multiple LLMs-including Phi2-2.7B , Llama3.2-1B, Gemma3-1B-PT, and Qwen3-0.6B-Base-achieving relative accuracy gains up to 18% when fine-tuned with AdamW and Muon on mathematical reasoning tasks.

Via

Access Paper or Ask Questions

GPT-2 Through the Lens of Vector Symbolic Architectures

Dec 10, 2024

Johannes Knittel, Tushaar Gangavarapu, Hendrik Strobelt, Hanspeter Pfister

Abstract:Understanding the general priniciples behind transformer models remains a complex endeavor. Experiments with probing and disentangling features using sparse autoencoders (SAE) suggest that these models might manage linear features embedded as directions in the residual stream. This paper explores the resemblance between decoder-only transformer architecture and vector symbolic architectures (VSA) and presents experiments indicating that GPT-2 uses mechanisms involving nearly orthogonal vector bundling and binding operations similar to VSA for computation and communication between layers. It further shows that these principles help explain a significant portion of the actual neural weights.

* 2nd Workshop on Attributing Model Behavior at Scale (ATTRIB) at NeurIPS 2024

Via

Access Paper or Ask Questions

MambaByte: Token-free Selective State Space Model

Jan 24, 2024

Junxiong Wang, Tushaar Gangavarapu, Jing Nathan Yan, Alexander M Rush

Figure 1 for MambaByte: Token-free Selective State Space Model

Figure 2 for MambaByte: Token-free Selective State Space Model

Figure 3 for MambaByte: Token-free Selective State Space Model

Figure 4 for MambaByte: Token-free Selective State Space Model

Abstract:Token-free language models learn directly from raw bytes and remove the bias of subword tokenization. Operating on bytes, however, results in significantly longer sequences, and standard autoregressive Transformers scale poorly in such settings. We experiment with MambaByte, a token-free adaptation of the Mamba state space model, trained autoregressively on byte sequences. Our experiments indicate the computational efficiency of MambaByte compared to other byte-level models. We also find MambaByte to be competitive with and even outperform state-of-the-art subword Transformers. Furthermore, owing to linear scaling in length, MambaByte benefits from fast inference compared to Transformers. Our findings establish the viability of MambaByte in enabling token-free language modeling.

Via

Access Paper or Ask Questions

Assessing the efficacy of large language models in generating accurate teacher responses

Jul 09, 2023

Yann Hicke, Abhishek Masand, Wentao Guo, Tushaar Gangavarapu

Figure 1 for Assessing the efficacy of large language models in generating accurate teacher responses

Figure 2 for Assessing the efficacy of large language models in generating accurate teacher responses

Abstract:(Tack et al., 2023) organized the shared task hosted by the 18th Workshop on Innovative Use of NLP for Building Educational Applications on generation of teacher language in educational dialogues. Following the structure of the shared task, in this study, we attempt to assess the generative abilities of large language models in providing informative and helpful insights to students, thereby simulating the role of a knowledgeable teacher. To this end, we present an extensive evaluation of several benchmarking generative models, including GPT-4 (few-shot, in-context learning), fine-tuned GPT-2, and fine-tuned DialoGPT. Additionally, to optimize for pedagogical quality, we fine-tuned the Flan-T5 model using reinforcement learning. Our experimental findings on the Teacher-Student Chatroom Corpus subset indicate the efficacy of GPT-4 over other fine-tuned models, measured using BERTScore and DialogRPT. We hypothesize that several dataset characteristics, including sampling, representativeness, and dialog completeness, pose significant challenges to fine-tuning, thus contributing to the poor generalizability of the fine-tuned models. Finally, we note the need for these generative models to be evaluated with a metric that relies not only on dialog coherence and matched language modeling distribution but also on the model's ability to showcase pedagogical skills.

Via

Access Paper or Ask Questions