Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Anna Rumshisky

Diverse, not Short: A Length-Controlled Self-Learning Framework for Improving Response Diversity of Language Models

May 22, 2025

Vijeta Deshpande, Debasmita Ghose, John D. Patterson, Roger Beaty, Anna Rumshisky

Figure 1 for Diverse, not Short: A Length-Controlled Self-Learning Framework for Improving Response Diversity of Language Models

Figure 2 for Diverse, not Short: A Length-Controlled Self-Learning Framework for Improving Response Diversity of Language Models

Figure 3 for Diverse, not Short: A Length-Controlled Self-Learning Framework for Improving Response Diversity of Language Models

Figure 4 for Diverse, not Short: A Length-Controlled Self-Learning Framework for Improving Response Diversity of Language Models

Abstract:Diverse language model responses are crucial for creative generation, open-ended tasks, and self-improvement training. We show that common diversity metrics, and even reward models used for preference optimization, systematically bias models toward shorter outputs, limiting expressiveness. To address this, we introduce Diverse, not Short (Diverse-NS), a length-controlled self-learning framework that improves response diversity while maintaining length parity. By generating and filtering preference data that balances diversity, quality, and length, Diverse-NS enables effective training using only 3,000 preference pairs. Applied to LLaMA-3.1-8B and the Olmo-2 family, Diverse-NS substantially enhances lexical and semantic diversity. We show consistent improvement in diversity with minor reduction or gains in response quality on four creative generation tasks: Divergent Associations, Persona Generation, Alternate Uses, and Creative Writing. Surprisingly, experiments with the Olmo-2 model family (7B, and 13B) show that smaller models like Olmo-2-7B can serve as effective "diversity teachers" for larger models. By explicitly addressing length bias, our method efficiently pushes models toward more diverse and expressive outputs.

Via

Access Paper or Ask Questions

MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs

Feb 04, 2025

Yuhang Zhou, Giannis Karamanolakis, Victor Soto, Anna Rumshisky, Mayank Kulkarni, Furong Huang, Wei Ai, Jianhua Lu

Figure 1 for MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs

Figure 2 for MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs

Figure 3 for MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs

Figure 4 for MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs

Abstract:The recent success of specialized Large Language Models (LLMs) in domains such as mathematical reasoning and coding has led to growing interest in methods for merging these expert LLMs into a unified Mixture-of-Experts (MoE) model, with the goal of enhancing performance in each domain while retaining effectiveness on general tasks. However, the effective merging of expert models remains an open challenge, especially for models with highly divergent weight parameters or different architectures. State-of-the-art MoE merging methods only work with homogeneous model architectures and rely on simple unweighted averaging to merge expert layers, which does not address parameter interference and requires extensive fine-tuning of the merged MoE to restore performance. To address these limitations, this paper introduces new MoE merging techniques, including strategies to mitigate parameter interference, routing heuristics to reduce the need for MoE fine-tuning, and a novel method for merging experts with different architectures. Extensive experiments across multiple domains demonstrate the effectiveness of our proposed methods, reducing fine-tuning costs, improving performance over state-of-the-art methods, and expanding the applicability of MoE merging.

* Accepted by NAACL 2025 Main

Via

Access Paper or Ask Questions

Emergent Abilities in Reduced-Scale Generative Language Models

Apr 02, 2024

Sherin Muckatira, Vijeta Deshpande, Vladislav Lialin, Anna Rumshisky

Figure 1 for Emergent Abilities in Reduced-Scale Generative Language Models

Figure 2 for Emergent Abilities in Reduced-Scale Generative Language Models

Figure 3 for Emergent Abilities in Reduced-Scale Generative Language Models

Figure 4 for Emergent Abilities in Reduced-Scale Generative Language Models

Abstract:Large language models can solve new tasks without task-specific fine-tuning. This ability, also known as in-context learning (ICL), is considered an emergent ability and is primarily seen in large language models with billions of parameters. This study investigates if such emergent properties are strictly tied to model size or can be demonstrated by smaller models trained on reduced-scale data. To explore this, we simplify pre-training data and pre-train 36 causal language models with parameters varying from 1 million to 165 million parameters. We show that models trained on this simplified pre-training data demonstrate enhanced zero-shot capabilities across various tasks in simplified language, achieving performance comparable to that of pre-trained models six times larger on unrestricted language. This suggests that downscaling the language allows zero-shot learning capabilities to emerge in models with limited size. Additionally, we find that these smaller models pre-trained on simplified data demonstrate a power law relationship between the evaluation loss and the three scaling factors: compute, dataset size, and model size.

* 16 pages, 4 figures. Accepted to NAACL 2024 Findings

Via

Access Paper or Ask Questions

Deconstructing In-Context Learning: Understanding Prompts via Corruption

Apr 02, 2024

Namrata Shivagunde, Vladislav Lialin, Sherin Muckatira, Anna Rumshisky

Figure 1 for Deconstructing In-Context Learning: Understanding Prompts via Corruption

Figure 2 for Deconstructing In-Context Learning: Understanding Prompts via Corruption

Figure 3 for Deconstructing In-Context Learning: Understanding Prompts via Corruption

Figure 4 for Deconstructing In-Context Learning: Understanding Prompts via Corruption

Abstract:The ability of large language models (LLMs) to "learn in context" based on the provided prompt has led to an explosive growth in their use, culminating in the proliferation of AI assistants such as ChatGPT, Claude, and Bard. These AI assistants are known to be robust to minor prompt modifications, mostly due to alignment techniques that use human feedback. In contrast, the underlying pre-trained LLMs they use as a backbone are known to be brittle in this respect. Building high-quality backbone models remains a core challenge, and a common approach to assessing their quality is to conduct few-shot evaluation. Such evaluation is notorious for being highly sensitive to minor prompt modifications, as well as the choice of specific in-context examples. Prior work has examined how modifying different elements of the prompt can affect model performance. However, these earlier studies tended to concentrate on a limited number of specific prompt attributes and often produced contradictory results. Additionally, previous research either focused on models with fewer than 15 billion parameters or exclusively examined black-box models like GPT-3 or PaLM, making replication challenging. In the present study, we decompose the entire prompt into four components: task description, demonstration inputs, labels, and inline instructions provided for each demonstration. We investigate the effects of structural and semantic corruptions of these elements on model performance. We study models ranging from 1.5B to 70B in size, using ten datasets covering classification and generation tasks. We find that repeating text within the prompt boosts model performance, and bigger models ($\geq$30B) are more sensitive to the semantics of the prompt. Finally, we observe that adding task and inline instructions to the demonstrations enhances model performance even when the instructions are semantically corrupted.

* Accepted to LREC-COLING 2024 main conference. The code is available at https://github.com/text-machine-lab/Understanding_prompts_via_corruption

Via

Access Paper or Ask Questions

Prompt Perturbation Consistency Learning for Robust Language Models

Feb 24, 2024

Yao Qiang, Subhrangshu Nandi, Ninareh Mehrabi, Greg Ver Steeg, Anoop Kumar, Anna Rumshisky, Aram Galstyan

Figure 1 for Prompt Perturbation Consistency Learning for Robust Language Models

Figure 2 for Prompt Perturbation Consistency Learning for Robust Language Models

Figure 3 for Prompt Perturbation Consistency Learning for Robust Language Models

Figure 4 for Prompt Perturbation Consistency Learning for Robust Language Models

Abstract:Large language models (LLMs) have demonstrated impressive performance on a number of natural language processing tasks, such as question answering and text summarization. However, their performance on sequence labeling tasks such as intent classification and slot filling (IC-SF), which is a central component in personal assistant systems, lags significantly behind discriminative models. Furthermore, there is a lack of substantive research on the robustness of LLMs to various perturbations in the input prompts. The contributions of this paper are three-fold. First, we show that fine-tuning sufficiently large LLMs can produce IC-SF performance comparable to discriminative models. Next, we systematically analyze the performance deterioration of those fine-tuned models due to three distinct yet relevant types of input perturbations - oronyms, synonyms, and paraphrasing. Finally, we propose an efficient mitigation approach, Prompt Perturbation Consistency Learning (PPCL), which works by regularizing the divergence between losses from clean and perturbed samples. Our experiments demonstrate that PPCL can recover on average 59% and 69% of the performance drop for IC and SF tasks, respectively. Furthermore, PPCL beats the data augmentation approach while using ten times fewer augmented data samples.

Via

Access Paper or Ask Questions

Let's Reinforce Step by Step

Nov 10, 2023

Sarah Pan, Vladislav Lialin, Sherin Muckatira, Anna Rumshisky

Figure 1 for Let's Reinforce Step by Step

Figure 2 for Let's Reinforce Step by Step

Abstract:While recent advances have boosted LM proficiency in linguistic benchmarks, LMs consistently struggle to reason correctly on complex tasks like mathematics. We turn to Reinforcement Learning from Human Feedback (RLHF) as a method with which to shape model reasoning processes. In particular, we explore two reward schemes, outcome-supervised reward models (ORMs) and process-supervised reward models (PRMs), to optimize for logical reasoning. Our results show that the fine-grained reward provided by PRM-based methods enhances accuracy on simple mathematical reasoning (GSM8K) while, unexpectedly, reducing performance in complex tasks (MATH). Furthermore, we show the critical role reward aggregation functions play in model performance. Providing promising avenues for future research, our study underscores the need for further exploration into fine-grained reward modeling for more reliable language models.

* NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following

Via

Access Paper or Ask Questions

Stack More Layers Differently: High-Rank Training Through Low-Rank Updates

Jul 13, 2023

Vladislav Lialin, Namrata Shivagunde, Sherin Muckatira, Anna Rumshisky

Figure 1 for Stack More Layers Differently: High-Rank Training Through Low-Rank Updates

Figure 2 for Stack More Layers Differently: High-Rank Training Through Low-Rank Updates

Figure 3 for Stack More Layers Differently: High-Rank Training Through Low-Rank Updates

Figure 4 for Stack More Layers Differently: High-Rank Training Through Low-Rank Updates

Abstract:Despite the dominance and effectiveness of scaling, resulting in large networks with hundreds of billions of parameters, the necessity to train overparametrized models remains poorly understood, and alternative approaches do not necessarily make it cheaper to train high-performance models. In this paper, we explore low-rank training techniques as an alternative approach to training large neural networks. We introduce a novel method called ReLoRA, which utilizes low-rank updates to train high-rank networks. We apply ReLoRA to pre-training transformer language models with up to 350M parameters and demonstrate comparable performance to regular neural network training. Furthermore, we observe that the efficiency of ReLoRA increases with model size, making it a promising approach for training multi-billion-parameter networks efficiently. Our findings shed light on the potential of low-rank training techniques and their implications for scaling laws.

Via

Access Paper or Ask Questions

Recipes for Sequential Pre-training of Multilingual Encoder and Seq2Seq Models

Jun 14, 2023

Saleh Soltan, Andy Rosenbaum, Tobias Falke, Qin Lu, Anna Rumshisky, Wael Hamza

Figure 1 for Recipes for Sequential Pre-training of Multilingual Encoder and Seq2Seq Models

Figure 2 for Recipes for Sequential Pre-training of Multilingual Encoder and Seq2Seq Models

Figure 3 for Recipes for Sequential Pre-training of Multilingual Encoder and Seq2Seq Models

Figure 4 for Recipes for Sequential Pre-training of Multilingual Encoder and Seq2Seq Models

Abstract:Pre-trained encoder-only and sequence-to-sequence (seq2seq) models each have advantages, however training both model types from scratch is computationally expensive. We explore recipes to improve pre-training efficiency by initializing one model from the other. (1) Extracting the encoder from a seq2seq model, we show it under-performs a Masked Language Modeling (MLM) encoder, particularly on sequence labeling tasks. Variations of masking during seq2seq training, reducing the decoder size, and continuing with a small amount of MLM training do not close the gap. (2) Conversely, using an encoder to warm-start seq2seq training, we show that by unfreezing the encoder partway through training, we can match task performance of a from-scratch seq2seq model. Overall, this two-stage approach is an efficient recipe to obtain both a multilingual encoder and a seq2seq model, matching the performance of training each model from scratch while reducing the total compute cost by 27%.

* ACL Findings 2023 and SustaiNLP Workshop 2023

Via

Access Paper or Ask Questions

Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale

May 30, 2023

Vijeta Deshpande, Dan Pechi, Shree Thatte, Vladislav Lialin, Anna Rumshisky

Abstract:In recent years, language models have drastically grown in size, and the abilities of these models have been shown to improve with scale. The majority of recent scaling laws studies focused on high-compute high-parameter count settings, leaving the question of when these abilities begin to emerge largely unanswered. In this paper, we investigate whether the effects of pre-training can be observed when the problem size is reduced, modeling a smaller, reduced-vocabulary language. We show the benefits of pre-training with masked language modeling (MLM) objective in models as small as 1.25M parameters, and establish a strong correlation between pre-training perplexity and downstream performance (GLUE benchmark). We examine downscaling effects, extending scaling laws to models as small as ~1M parameters. At this scale, we observe a break of the power law for compute-optimal models and show that the MLM loss does not scale smoothly with compute-cost (FLOPs) below $2.2 \times 10^{15}$ FLOPs. We also find that adding layers does not always benefit downstream performance.

* Accepted to ACL 2023 Findings

Via

Access Paper or Ask Questions

Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

Apr 04, 2023

Vladislav Lialin, Stephen Rawls, David Chan, Shalini Ghosh, Anna Rumshisky, Wael Hamza

Figure 1 for Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

Figure 2 for Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

Figure 3 for Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

Figure 4 for Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

Abstract:Scaling up weakly-supervised datasets has shown to be highly effective in the image-text domain and has contributed to most of the recent state-of-the-art computer vision and multimodal neural networks. However, existing large-scale video-text datasets and mining techniques suffer from several limitations, such as the scarcity of aligned data, the lack of diversity in the data, and the difficulty of collecting aligned data. Currently popular video-text data mining approach via automatic speech recognition (ASR) used in HowTo100M provides low-quality captions that often do not refer to the video content. Other mining approaches do not provide proper language descriptions (video tags) and are biased toward short clips (alt text). In this work, we show how recent advances in image captioning allow us to pre-train high-quality video models without any parallel video-text data. We pre-train several video captioning models that are based on an OPT language model and a TimeSformer visual backbone. We fine-tune these networks on several video captioning datasets. First, we demonstrate that image captioning pseudolabels work better for pre-training than the existing HowTo100M ASR captions. Second, we show that pre-training on both images and videos produces a significantly better network (+4 CIDER on MSR-VTT) than pre-training on a single modality. Our methods are complementary to the existing pre-training or data mining approaches and can be used in a variety of settings. Given the efficacy of the pseudolabeling method, we are planning to publicly release the generated captions.

* 2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)

Via

Access Paper or Ask Questions