Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Payal Bajaj

Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers

May 21, 2023

Linyuan Gong, Chenyan Xiong, Xiaodong Liu, Payal Bajaj, Yiqing Xie, Alvin Cheung, Jianfeng Gao, Xia Song

Figure 1 for Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers

Figure 2 for Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers

Figure 3 for Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers

Figure 4 for Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers

Abstract:This paper explores the effectiveness of model-generated signals in improving zero-shot generalization of text-to-text Transformers such as T5. We study various designs to pretrain T5 using an auxiliary model to construct more challenging token replacements for the main model to denoise. Key aspects under study include the decoding target, the location of the RTD head, and the masking pattern. Based on these studies, we develop a new model, METRO-T0, which is pretrained using the redesigned ELECTRA-Style pretraining strategies and then prompt-finetuned on a mixture of NLP tasks. METRO-T0 outperforms all similar-sized baselines on prompted NLP benchmarks, such as T0 Eval and MMLU, and rivals the state-of-the-art T0-11B model with only 8% of its parameters. Our analysis on model's neural activation and parameter sensitivity reveals that the effectiveness of METRO-T0 stems from more balanced contribution of parameters and better utilization of their capacity. The code and model checkpoints are available at https://github.com/gonglinyuan/metro_t0.

* Published as a conference paper at ACL 2023. 9 pages

Via

Access Paper or Ask Questions

Understanding the Effectiveness of Very Large Language Models on Dialog Evaluation

Jan 27, 2023

Jessica Huynh, Cathy Jiao, Prakhar Gupta, Shikib Mehri, Payal Bajaj, Vishrav Chaudhary, Maxine Eskenazi

Figure 1 for Understanding the Effectiveness of Very Large Language Models on Dialog Evaluation

Figure 2 for Understanding the Effectiveness of Very Large Language Models on Dialog Evaluation

Figure 3 for Understanding the Effectiveness of Very Large Language Models on Dialog Evaluation

Figure 4 for Understanding the Effectiveness of Very Large Language Models on Dialog Evaluation

Abstract:Language models have steadily increased in size over the past few years. They achieve a high level of performance on various natural language processing (NLP) tasks such as question answering and summarization. Large language models (LLMs) have been used for generation and can now output human-like text. Due to this, there are other downstream tasks in the realm of dialog that can now harness the LLMs' language understanding capabilities. Dialog evaluation is one task that this paper will explore. It concentrates on prompting with LLMs: BLOOM, OPT, GPT-3, Flan-T5, InstructDial and TNLGv2. The paper shows that the choice of datasets used for training a model contributes to how well it performs on a task as well as on how the prompt should be structured. Specifically, the more diverse and relevant the group of datasets that a model is trained on, the better dialog evaluation performs. This paper also investigates how the number of examples in the prompt and the type of example selection used affect the model's performance.

* Accepted for publication at IWSDS 2023

Via

Access Paper or Ask Questions

Foundation Transformers

Oct 19, 2022

Hongyu Wang, Shuming Ma, Shaohan Huang, Li Dong, Wenhui Wang, Zhiliang Peng, Yu Wu, Payal Bajaj, Saksham Singhal, Alon Benhaim(+5 more)

Abstract:A big convergence of model architectures across language, vision, speech, and multimodal is emerging. However, under the same name "Transformers", the above areas use different implementations for better performance, e.g., Post-LayerNorm for BERT, and Pre-LayerNorm for GPT and vision Transformers. We call for the development of Foundation Transformer for true general-purpose modeling, which serves as a go-to architecture for various tasks and modalities with guaranteed training stability. In this work, we introduce a Transformer variant, named Magneto, to fulfill the goal. Specifically, we propose Sub-LayerNorm for good expressivity, and the initialization strategy theoretically derived from DeepNet for stable scaling up. Extensive experiments demonstrate its superior performance and better stability than the de facto Transformer variants designed for various applications, including language modeling (i.e., BERT, and GPT), machine translation, vision pretraining (i.e., BEiT), speech recognition, and multimodal pretraining (i.e., BEiT-3).

* Work in progress

Via

Access Paper or Ask Questions

On the Representation Collapse of Sparse Mixture of Experts

Apr 20, 2022

Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Furu Wei

Figure 1 for On the Representation Collapse of Sparse Mixture of Experts

Figure 2 for On the Representation Collapse of Sparse Mixture of Experts

Figure 3 for On the Representation Collapse of Sparse Mixture of Experts

Figure 4 for On the Representation Collapse of Sparse Mixture of Experts

Abstract:Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead. It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations. However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse. In this work, we propose to estimate the routing scores between tokens and experts on a low-dimensional hypersphere. We conduct extensive experiments on cross-lingual language model pre-training and fine-tuning on downstream tasks. Experimental results across seven multilingual benchmarks show that our method achieves consistent gains. We also present a comprehensive analysis on the representation and routing behaviors of our models. Our method alleviates the representation collapse issue and achieves more consistent routing than the baseline mixture-of-experts methods.

Via

Access Paper or Ask Questions

METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals

Apr 16, 2022

Payal Bajaj, Chenyan Xiong, Guolin Ke, Xiaodong Liu, Di He, Saurabh Tiwary, Tie-Yan Liu, Paul Bennett, Xia Song, Jianfeng Gao

Figure 1 for METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals

Figure 2 for METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals

Figure 3 for METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals

Figure 4 for METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals

Abstract:We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model. Originated in ELECTRA, this training strategy has demonstrated sample-efficiency to pretrain models at the scale of hundreds of millions of parameters. In this work, we conduct a comprehensive empirical study, and propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO), which incorporates some of the best modeling techniques developed recently to speed up, stabilize, and enhance pretrained language models without compromising model effectiveness. The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks. More importantly, METRO-LM are efficient in that they often outperform previous large models with significantly smaller model sizes and lower pretraining cost.

* Update details in scaled initialization and add acknowledgement

Via

Access Paper or Ask Questions

Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators

Apr 07, 2022

Yu Meng, Chenyan Xiong, Payal Bajaj, Saurabh Tiwary, Paul Bennett, Jiawei Han, Xia Song

Figure 1 for Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators

Figure 2 for Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators

Figure 3 for Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators

Figure 4 for Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators

Abstract:We present a new framework AMOS that pretrains text encoders with an Adversarial learning curriculum via a Mixture Of Signals from multiple auxiliary generators. Following ELECTRA-style pretraining, the main encoder is trained as a discriminator to detect replaced tokens generated by auxiliary masked language models (MLMs). Different from ELECTRA which trains one MLM as the generator, we jointly train multiple MLMs of different sizes to provide training signals at various levels of difficulty. To push the discriminator to learn better with challenging replaced tokens, we learn mixture weights over the auxiliary MLMs' outputs to maximize the discriminator loss by backpropagating the gradient from the discriminator via Gumbel-Softmax. For better pretraining efficiency, we propose a way to assemble multiple MLMs into one unified auxiliary model. AMOS outperforms ELECTRA and recent state-of-the-art pretrained models by about 1 point on the GLUE benchmark for BERT base-sized models.

* ICLR 2022. (Code and Models: https://github.com/microsoft/AMOS)

Via

Access Paper or Ask Questions

XLM-E: Cross-lingual Language Model Pre-training via ELECTRA

Jun 30, 2021

Zewen Chi, Shaohan Huang, Li Dong, Shuming Ma, Saksham Singhal, Payal Bajaj, Xia Song, Furu Wei

Figure 1 for XLM-E: Cross-lingual Language Model Pre-training via ELECTRA

Figure 2 for XLM-E: Cross-lingual Language Model Pre-training via ELECTRA

Figure 3 for XLM-E: Cross-lingual Language Model Pre-training via ELECTRA

Figure 4 for XLM-E: Cross-lingual Language Model Pre-training via ELECTRA

Abstract:In this paper, we introduce ELECTRA-style tasks to cross-lingual language model pre-training. Specifically, we present two pre-training tasks, namely multilingual replaced token detection, and translation replaced token detection. Besides, we pretrain the model, named as XLM-E, on both multilingual and parallel corpora. Our model outperforms the baseline models on various cross-lingual understanding tasks with much less computation cost. Moreover, analysis shows that XLM-E tends to obtain better cross-lingual transferability.

Via

Access Paper or Ask Questions

Language Scaling for Universal Suggested Replies Model

Jun 04, 2021

Qianlan Ying, Payal Bajaj, Budhaditya Deb, Yu Yang, Wei Wang, Bojia Lin, Milad Shokouhi, Xia Song, Yang Yang, Daxin Jiang

Figure 1 for Language Scaling for Universal Suggested Replies Model

Figure 2 for Language Scaling for Universal Suggested Replies Model

Figure 3 for Language Scaling for Universal Suggested Replies Model

Figure 4 for Language Scaling for Universal Suggested Replies Model

Abstract:We consider the problem of scaling automated suggested replies for Outlook email system to multiple languages. Faced with increased compute requirements and low resources for language expansion, we build a single universal model for improving the quality and reducing run-time costs of our production system. However, restricted data movement across regional centers prevents joint training across languages. To this end, we propose a multi-task continual learning framework, with auxiliary tasks and language adapters to learn universal language representation across regions. The experimental results show positive cross-lingual transfer across languages while reducing catastrophic forgetting across regions. Our online results on real user traffic show significant gains in CTR and characters saved, as well as 65% training cost reduction compared with per-language models. As a consequence, we have scaled the feature in multiple languages including low-resource markets.

Via

Access Paper or Ask Questions

COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

Feb 16, 2021

Yu Meng, Chenyan Xiong, Payal Bajaj, Saurabh Tiwary, Paul Bennett, Jiawei Han, Xia Song

Figure 1 for COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

Figure 2 for COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

Figure 3 for COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

Figure 4 for COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

Abstract:We present COCO-LM, a new self-supervised learning framework that pretrains Language Models by COrrecting challenging errors and COntrasting text sequences. COCO-LM employs an auxiliary language model to mask-and-predict tokens in original text sequences. It creates more challenging pretraining inputs, where noises are sampled based on their likelihood in the auxiliary language model. COCO-LM then pretrains with two tasks: The first task, corrective language modeling, learns to correct the auxiliary model's corruptions by recovering the original tokens. The second task, sequence contrastive learning, ensures that the language model generates sequence representations that are invariant to noises and transformations. In our experiments on the GLUE and SQuAD benchmarks, COCO-LM outperforms recent pretraining approaches in various pretraining settings and few-shot evaluations, with higher pretraining efficiency. Our analyses reveal that COCO-LM's advantages come from its challenging training signals, more contextualized token representations, and regularized sequence representations.

Via

Access Paper or Ask Questions

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Oct 31, 2018

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen(+5 more)

Figure 1 for MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Figure 2 for MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Figure 3 for MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Figure 4 for MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Abstract:We introduce a large scale MAchine Reading COmprehension dataset, which we name MS MARCO. The dataset comprises of 1,010,916 anonymized questions---sampled from Bing's search query logs---each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages---extracted from 3,563,535 web documents retrieved by Bing---that provide the information necessary for curating the natural language answers. A question in the MS MARCO dataset may have multiple answers or no answers at all. Using this dataset, we propose three different tasks with varying levels of difficulty: (i) predict if a question is answerable given a set of context passages, and extract and synthesize the answer as a human would (ii) generate a well-formed answer (if possible) based on the context passages that can be understood with the question and passage context, and finally (iii) rank a set of retrieved passages given a question. The size of the dataset and the fact that the questions are derived from real user search queries distinguishes MS MARCO from other well-known publicly available datasets for machine reading comprehension and question-answering. We believe that the scale and the real-world nature of this dataset makes it attractive for benchmarking machine reading comprehension and question-answering models.

Via

Access Paper or Ask Questions