Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kishore Papineni

Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines

Jul 22, 2024

Yuchen Li, Alexandre Kirchmeyer, Aashay Mehta, Yilong Qin, Boris Dadachev, Kishore Papineni, Sanjiv Kumar, Andrej Risteski

Abstract:Autoregressive language models are the currently dominant paradigm for text generation, but they have some fundamental limitations that cannot be remedied by scale-for example inherently sequential and unidirectional generation. While alternate classes of models have been explored, we have limited mathematical understanding of their fundamental power and limitations. In this paper we focus on Generative Masked Language Models (GMLMs), a non-autoregressive paradigm in which we train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model, These models empirically strike a promising speed-quality trade-off as each step can be typically parallelized by decoding the entire sequence in parallel. We develop a mathematical framework for analyzing and improving such models which sheds light on questions of sample complexity and inference speed and quality. Empirically, we adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality compared with autoregressive models. We run careful ablation experiments to give recommendations on key design choices, and make fine-grained observations on the common error modes in connection with our theory. Our mathematical analyses and empirical observations characterize both potentials and limitations of this approach, and can be applied to future works on improving understanding and performance of GMLMs. Our codes are released at https://github.com/google-research/google-research/tree/master/padir

* ICML 2024

Via

Access Paper or Ask Questions

Towards Fast Inference: Exploring and Improving Blockwise Parallel Drafts

Apr 14, 2024

Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton

Figure 1 for Towards Fast Inference: Exploring and Improving Blockwise Parallel Drafts

Figure 2 for Towards Fast Inference: Exploring and Improving Blockwise Parallel Drafts

Figure 3 for Towards Fast Inference: Exploring and Improving Blockwise Parallel Drafts

Figure 4 for Towards Fast Inference: Exploring and Improving Blockwise Parallel Drafts

Abstract:Despite the remarkable strides made by autoregressive language models, their potential is often hampered by the slow inference speeds inherent in sequential token generation. Blockwise parallel decoding (BPD) was proposed by Stern et al. (2018) as a way to improve inference speed of language models. In this paper, we make two contributions to understanding and improving BPD drafts. We first offer an analysis of the token distributions produced by the BPD prediction heads. Secondly, we use this analysis to inform algorithms to improve BPD inference speed by refining the BPD drafts using small n-gram or neural language models. We empirically show that these refined BPD drafts yield a higher average verified prefix length across tasks.

Via

Access Paper or Ask Questions

Balancing Robustness and Sensitivity using Feature Contrastive Learning

May 19, 2021

Seungyeon Kim, Daniel Glasner, Srikumar Ramalingam, Cho-Jui Hsieh, Kishore Papineni, Sanjiv Kumar

Figure 1 for Balancing Robustness and Sensitivity using Feature Contrastive Learning

Figure 2 for Balancing Robustness and Sensitivity using Feature Contrastive Learning

Figure 3 for Balancing Robustness and Sensitivity using Feature Contrastive Learning

Figure 4 for Balancing Robustness and Sensitivity using Feature Contrastive Learning

Abstract:It is generally believed that robust training of extremely large networks is critical to their success in real-world applications. However, when taken to the extreme, methods that promote robustness can hurt the model's sensitivity to rare or underrepresented patterns. In this paper, we discuss this trade-off between sensitivity and robustness to natural (non-adversarial) perturbations by introducing two notions: contextual feature utility and contextual feature sensitivity. We propose Feature Contrastive Learning (FCL) that encourages a model to be more sensitive to the features that have higher contextual utility. Empirical results demonstrate that models trained with FCL achieve a better balance of robustness and sensitivity, leading to improved generalization in the presence of noise on both vision and NLP datasets.

* 31 pages, 5 figures, 3 tables

Via

Access Paper or Ask Questions

Text Segmentation by Cross Segment Attention

Apr 30, 2020

Michal Lukasik, Boris Dadachev, Gonçalo Simões, Kishore Papineni

Figure 1 for Text Segmentation by Cross Segment Attention

Figure 2 for Text Segmentation by Cross Segment Attention

Figure 3 for Text Segmentation by Cross Segment Attention

Figure 4 for Text Segmentation by Cross Segment Attention

Abstract:Document and discourse segmentation are two fundamental NLP tasks pertaining to breaking up text into constituents, which are commonly used to help downstream tasks such as information retrieval or text summarization. In this work, we propose three transformer-based architectures and provide comprehensive comparisons with previously proposed approaches on three standard datasets. We establish a new state-of-the-art, reducing in particular the error rates by a large margin in all cases. We further analyze model sizes and find that we can build models with many fewer parameters while keeping good performance, thus facilitating real-world applications.

* 10 pages, 4 figures

Via

Access Paper or Ask Questions