Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mayank Kulkarni

MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs

Feb 04, 2025

Yuhang Zhou, Giannis Karamanolakis, Victor Soto, Anna Rumshisky, Mayank Kulkarni, Furong Huang, Wei Ai, Jianhua Lu

Figure 1 for MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs

Figure 2 for MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs

Figure 3 for MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs

Figure 4 for MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs

Abstract:The recent success of specialized Large Language Models (LLMs) in domains such as mathematical reasoning and coding has led to growing interest in methods for merging these expert LLMs into a unified Mixture-of-Experts (MoE) model, with the goal of enhancing performance in each domain while retaining effectiveness on general tasks. However, the effective merging of expert models remains an open challenge, especially for models with highly divergent weight parameters or different architectures. State-of-the-art MoE merging methods only work with homogeneous model architectures and rely on simple unweighted averaging to merge expert layers, which does not address parameter interference and requires extensive fine-tuning of the merged MoE to restore performance. To address these limitations, this paper introduces new MoE merging techniques, including strategies to mitigate parameter interference, routing heuristics to reduce the need for MoE fine-tuning, and a novel method for merging experts with different architectures. Extensive experiments across multiple domains demonstrate the effectiveness of our proposed methods, reducing fine-tuning costs, improving performance over state-of-the-art methods, and expanding the applicability of MoE merging.

* Accepted by NAACL 2025 Main

Via

Access Paper or Ask Questions

MAML-en-LLM: Model Agnostic Meta-Training of LLMs for Improved In-Context Learning

May 19, 2024

Sanchit Sinha, Yuguang Yue, Victor Soto, Mayank Kulkarni, Jianhua Lu, Aidong Zhang

Figure 1 for MAML-en-LLM: Model Agnostic Meta-Training of LLMs for Improved In-Context Learning

Figure 2 for MAML-en-LLM: Model Agnostic Meta-Training of LLMs for Improved In-Context Learning

Figure 3 for MAML-en-LLM: Model Agnostic Meta-Training of LLMs for Improved In-Context Learning

Figure 4 for MAML-en-LLM: Model Agnostic Meta-Training of LLMs for Improved In-Context Learning

Abstract:Adapting large language models (LLMs) to unseen tasks with in-context training samples without fine-tuning remains an important research problem. To learn a robust LLM that adapts well to unseen tasks, multiple meta-training approaches have been proposed such as MetaICL and MetaICT, which involve meta-training pre-trained LLMs on a wide variety of diverse tasks. These meta-training approaches essentially perform in-context multi-task fine-tuning and evaluate on a disjointed test set of tasks. Even though they achieve impressive performance, their goal is never to compute a truly general set of parameters. In this paper, we propose MAML-en-LLM, a novel method for meta-training LLMs, which can learn truly generalizable parameters that not only perform well on disjointed tasks but also adapts to unseen tasks. We see an average increase of 2% on unseen domains in the performance while a massive 4% improvement on adaptation performance. Furthermore, we demonstrate that MAML-en-LLM outperforms baselines in settings with limited amount of training data on both seen and unseen domains by an average of 2%. Finally, we discuss the effects of type of tasks, optimizers and task complexity, an avenue barely explored in meta-training literature. Exhaustive experiments across 7 task settings along with two data settings demonstrate that models trained with MAML-en-LLM outperform SOTA meta-training approaches.

* KDD 2024, 11 pages(9 main, 2 ref, 1 App) Openreview https://openreview.net/forum?id=JwecLNhWDy&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DKDD.org%2F2024%2FResearch_Track%2FAuthors%23your-submissions)

Via

Access Paper or Ask Questions

Overcoming Catastrophic Forgetting in Massively Multilingual Continual Learning

May 25, 2023

Genta Indra Winata, Lingjue Xie, Karthik Radhakrishnan, Shijie Wu, Xisen Jin, Pengxiang Cheng, Mayank Kulkarni, Daniel Preotiuc-Pietro

Figure 1 for Overcoming Catastrophic Forgetting in Massively Multilingual Continual Learning

Figure 2 for Overcoming Catastrophic Forgetting in Massively Multilingual Continual Learning

Figure 3 for Overcoming Catastrophic Forgetting in Massively Multilingual Continual Learning

Figure 4 for Overcoming Catastrophic Forgetting in Massively Multilingual Continual Learning

Abstract:Real-life multilingual systems should be able to efficiently incorporate new languages as data distributions fed to the system evolve and shift over time. To do this, systems need to handle the issue of catastrophic forgetting, where the model performance drops for languages or tasks seen further in its past. In this paper, we study catastrophic forgetting, as well as methods to minimize this, in a massively multilingual continual learning framework involving up to 51 languages and covering both classification and sequence labeling tasks. We present LR ADJUST, a learning rate scheduling method that is simple, yet effective in preserving new information without strongly overwriting past knowledge. Furthermore, we show that this method is effective across multiple continual learning approaches. Finally, we provide further insights into the dynamics of catastrophic forgetting in this massively multilingual setup.

* ACL 2023 Findings

Via

Access Paper or Ask Questions

EntSUM: A Data Set for Entity-Centric Summarization

Apr 05, 2022

Mounica Maddela, Mayank Kulkarni, Daniel Preotiuc-Pietro

Figure 1 for EntSUM: A Data Set for Entity-Centric Summarization

Figure 2 for EntSUM: A Data Set for Entity-Centric Summarization

Figure 3 for EntSUM: A Data Set for Entity-Centric Summarization

Figure 4 for EntSUM: A Data Set for Entity-Centric Summarization

Abstract:Controllable summarization aims to provide summaries that take into account user-specified aspects and preferences to better assist them with their information need, as opposed to the standard summarization setup which build a single generic summary of a document. We introduce a human-annotated data set EntSUM for controllable summarization with a focus on named entities as the aspects to control. We conduct an extensive quantitative analysis to motivate the task of entity-centric summarization and show that existing methods for controllable summarization fail to generate entity-centric summaries. We propose extensions to state-of-the-art summarization approaches that achieve substantially better results on our data set. Our analysis and results show the challenging nature of this task and of the proposed data set.

* Accepted at ACL 2022

Via

Access Paper or Ask Questions

Learning Rich Representation of Keyphrases from Text

Dec 16, 2021

Mayank Kulkarni, Debanjan Mahata, Ravneet Arora, Rajarshi Bhowmik

Figure 1 for Learning Rich Representation of Keyphrases from Text

Figure 2 for Learning Rich Representation of Keyphrases from Text

Figure 3 for Learning Rich Representation of Keyphrases from Text

Figure 4 for Learning Rich Representation of Keyphrases from Text

Abstract:In this work, we explore how to learn task-specific language models aimed towards learning rich representation of keyphrases from text documents. We experiment with different masking strategies for pre-training transformer language models (LMs) in discriminative as well as generative settings. In the discriminative setting, we introduce a new pre-training objective - Keyphrase Boundary Infilling with Replacement (KBIR), showing large gains in performance (upto 9.26 points in F1) over SOTA, when LM pre-trained using KBIR is fine-tuned for the task of keyphrase extraction. In the generative setting, we introduce a new pre-training setup for BART - KeyBART, that reproduces the keyphrases related to the input text in the CatSeq format, instead of the denoised original input. This also led to gains in performance (upto 4.33 points in F1@M) over SOTA for keyphrase generation. Additionally, we also fine-tune the pre-trained language models on named entity recognition (NER), question answering (QA), relation extraction (RE), abstractive summarization and achieve comparable performance with that of the SOTA, showing that learning rich representation of keyphrases is indeed beneficial for many other fundamental NLP tasks.

Via

Access Paper or Ask Questions

Keyphrase Extraction from Scholarly Articles as Sequence Labeling using Contextualized Embeddings

Oct 19, 2019

Dhruva Sahrawat, Debanjan Mahata, Mayank Kulkarni, Haimin Zhang, Rakesh Gosangi, Amanda Stent, Agniv Sharma, Yaman Kumar, Rajiv Ratn Shah, Roger Zimmermann

Figure 1 for Keyphrase Extraction from Scholarly Articles as Sequence Labeling using Contextualized Embeddings

Figure 2 for Keyphrase Extraction from Scholarly Articles as Sequence Labeling using Contextualized Embeddings

Figure 3 for Keyphrase Extraction from Scholarly Articles as Sequence Labeling using Contextualized Embeddings

Figure 4 for Keyphrase Extraction from Scholarly Articles as Sequence Labeling using Contextualized Embeddings

Abstract:In this paper, we formulate keyphrase extraction from scholarly articles as a sequence labeling task solved using a BiLSTM-CRF, where the words in the input text are represented using deep contextualized embeddings. We evaluate the proposed architecture using both contextualized and fixed word embedding models on three different benchmark datasets (Inspec, SemEval 2010, SemEval 2017) and compare with existing popular unsupervised and supervised techniques. Our results quantify the benefits of (a) using contextualized embeddings (e.g. BERT) over fixed word embeddings (e.g. Glove); (b) using a BiLSTM-CRF architecture with contextualized word embeddings over fine-tuning the contextualized word embedding model directly, and (c) using genre-specific contextualized embeddings (SciBERT). Through error analysis, we also provide some insights into why particular models work better than others. Lastly, we present a case study where we analyze different self-attention layers of the two best models (BERT and SciBERT) to better understand the predictions made by each for the task of keyphrase extraction.

Via

Access Paper or Ask Questions