Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nan Ding

Purdue University

CausalLM is not optimal for in-context learning

Sep 03, 2023

Nan Ding, Tomer Levinboim, Jialin Wu, Sebastian Goodman, Radu Soricut

Figure 1 for CausalLM is not optimal for in-context learning

Figure 2 for CausalLM is not optimal for in-context learning

Figure 3 for CausalLM is not optimal for in-context learning

Figure 4 for CausalLM is not optimal for in-context learning

Abstract:Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model (prefixLM), in which in-context samples can all attend to each other, compared to causal language models (causalLM), which use auto-regressive attention that prohibits in-context samples to attend to future samples. While this result is intuitive, it is not understood from a theoretical perspective. In this paper we take a theoretical approach and analyze the convergence behavior of prefixLM and causalLM under a certain parameter construction. Our analysis shows that both LM types converge to their stationary points at a linear rate, but that while prefixLM converges to the optimal solution of linear regression, causalLM convergence dynamics follows that of an online gradient descent algorithm, which is not guaranteed to be optimal even as the number of samples grows infinitely. We supplement our theoretical claims with empirical experiments over synthetic and real tasks and using various types of transformers. Our experiments verify that causalLM consistently underperforms prefixLM in all settings.

Via

Access Paper or Ask Questions

Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization

Nov 22, 2022

Zifan Wang, Nan Ding, Tomer Levinboim, Xi Chen, Radu Soricut

Abstract:Recent research in robust optimization has shown an overfitting-like phenomenon in which models trained against adversarial attacks exhibit higher robustness on the training set compared to the test set. Although previous work provided theoretical explanations for this phenomenon using a robust PAC-Bayesian bound over the adversarial test error, related algorithmic derivations are at best only loosely connected to this bound, which implies that there is still a gap between their empirical success and our understanding of adversarial robustness theory. To close this gap, in this paper we consider a different form of the robust PAC-Bayesian bound and directly minimize it with respect to the model posterior. The derivation of the optimal solution connects PAC-Bayesian learning to the geometry of the robust loss surface through a Trace of Hessian (TrH) regularizer that measures the surface flatness. In practice, we restrict the TrH regularizer to the top layer only, which results in an analytical solution to the bound whose computational cost does not depend on the network depth. Finally, we evaluate our TrH regularization approach over CIFAR-10/100 and ImageNet using Vision Transformers (ViT) and compare against baseline adversarial robustness algorithms. Experimental results show that TrH regularization leads to improved ViT robustness that either matches or surpasses previous state-of-the-art approaches while at the same time requires less memory and computational cost.

Via

Access Paper or Ask Questions

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Sep 16, 2022

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer(+19 more)

Figure 1 for PaLI: A Jointly-Scaled Multilingual Language-Image Model

Figure 2 for PaLI: A Jointly-Scaled Multilingual Language-Image Model

Figure 3 for PaLI: A Jointly-Scaled Multilingual Language-Image Model

Figure 4 for PaLI: A Jointly-Scaled Multilingual Language-Image Model

Abstract:Effective scaling and a flexible task interface enable large language models to excel at many tasks. PaLI (Pathways Language and Image model) extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pretrained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train the largest ViT to date (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.

Via

Access Paper or Ask Questions

All You May Need for VQA are Image Captions

May 04, 2022

Soravit Changpinyo, Doron Kukliansky, Idan Szpektor, Xi Chen, Nan Ding, Radu Soricut

Figure 1 for All You May Need for VQA are Image Captions

Figure 2 for All You May Need for VQA are Image Captions

Figure 3 for All You May Need for VQA are Image Captions

Figure 4 for All You May Need for VQA are Image Captions

Abstract:Visual Question Answering (VQA) has benefited from increasingly sophisticated models, but has not enjoyed the same level of engagement in terms of data creation. In this paper, we propose a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation. We show that the resulting data is of high-quality. VQA models trained on our data improve state-of-the-art zero-shot accuracy by double digits and achieve a level of robustness that lacks in the same model trained on human-annotated VQA data.

* 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2022)

Via

Access Paper or Ask Questions

PACTran: PAC-Bayesian Metrics for Estimating the Transferability of Pretrained Models to Classification Tasks

Mar 10, 2022

Nan Ding, Xi Chen, Tomer Levinboim, Beer Changpinyo, Radu Soricut

Figure 1 for PACTran: PAC-Bayesian Metrics for Estimating the Transferability of Pretrained Models to Classification Tasks

Figure 2 for PACTran: PAC-Bayesian Metrics for Estimating the Transferability of Pretrained Models to Classification Tasks

Figure 3 for PACTran: PAC-Bayesian Metrics for Estimating the Transferability of Pretrained Models to Classification Tasks

Figure 4 for PACTran: PAC-Bayesian Metrics for Estimating the Transferability of Pretrained Models to Classification Tasks

Abstract:With the increasing abundance of pretrained models in recent years, the problem of selecting the best pretrained checkpoint for a particular downstream classification task has been gaining increased attention. Although several methods have recently been proposed to tackle the selection problem (e.g. LEEP, H-score), these methods resort to applying heuristics that are not well motivated by learning theory. In this paper we present PACTran, a theoretically grounded family of metrics for pretrained model selection and transferability measurement. We first show how to derive PACTran metrics from the optimal PAC-Bayesian bound under the transfer learning setting. We then empirically evaluate three metric instantiations of PACTran on a number of vision tasks (VTAB) as well as a language-and-vision (OKVQA) task. An analysis of the results shows PACTran is a more consistent and effective transferability measure compared to existing selection methods.

Via

Access Paper or Ask Questions

Bridging the Gap Between Practice and PAC-Bayes Theory in Few-Shot Meta-Learning

May 28, 2021

Nan Ding, Xi Chen, Tomer Levinboim, Sebastian Goodman, Radu Soricut

Figure 1 for Bridging the Gap Between Practice and PAC-Bayes Theory in Few-Shot Meta-Learning

Figure 2 for Bridging the Gap Between Practice and PAC-Bayes Theory in Few-Shot Meta-Learning

Figure 3 for Bridging the Gap Between Practice and PAC-Bayes Theory in Few-Shot Meta-Learning

Figure 4 for Bridging the Gap Between Practice and PAC-Bayes Theory in Few-Shot Meta-Learning

Abstract:Despite recent advances in its theoretical understanding, there still remains a significant gap in the ability of existing PAC-Bayesian theories on meta-learning to explain performance improvements in the few-shot learning setting, where the number of training examples in the target tasks is severely limited. This gap originates from an assumption in the existing theories which supposes that the number of training examples in the observed tasks and the number of training examples in the target tasks follow the same distribution, an assumption that rarely holds in practice. By relaxing this assumption, we develop two PAC-Bayesian bounds tailored for the few-shot learning setting and show that two existing meta-learning algorithms (MAML and Reptile) can be derived from our bounds, thereby bridging the gap between practice and PAC-Bayesian theories. Furthermore, we derive a new computationally-efficient PACMAML algorithm, and show it outperforms existing meta-learning algorithms on several few-shot benchmark datasets.

Via

Access Paper or Ask Questions

Do Transformer Modifications Transfer Across Implementations and Applications?

Feb 23, 2021

Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan(+6 more)

Figure 1 for Do Transformer Modifications Transfer Across Implementations and Applications?

Figure 2 for Do Transformer Modifications Transfer Across Implementations and Applications?

Figure 3 for Do Transformer Modifications Transfer Across Implementations and Applications?

Abstract:The research community has proposed copious modifications to the Transformer architecture since it was introduced over three years ago, relatively few of which have seen widespread adoption. In this paper, we comprehensively evaluate many of these modifications in a shared experimental setting that covers most of the common uses of the Transformer in natural language processing. Surprisingly, we find that most modifications do not meaningfully improve performance. Furthermore, most of the Transformer variants we found beneficial were either developed in the same codebase that we used or are relatively minor changes. We conjecture that performance improvements may strongly depend on implementation details and correspondingly make some recommendations for improving the generality of experimental results.

Via

Access Paper or Ask Questions

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

Feb 17, 2021

Soravit Changpinyo, Piyush Sharma, Nan Ding, Radu Soricut

Figure 1 for Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

Figure 2 for Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

Figure 3 for Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

Figure 4 for Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

Abstract:The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pre-training. However, these datasets are often collected with overrestrictive requirements, inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pre-training data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [Sharma et al. 2018] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training. We perform an analysis of this dataset, as well as benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. The quantitative and qualitative results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.

Via

Access Paper or Ask Questions

Improving Text Generation Evaluation with Batch Centering and Tempered Word Mover Distance

Oct 13, 2020

Xi Chen, Nan Ding, Tomer Levinboim, Radu Soricut

Figure 1 for Improving Text Generation Evaluation with Batch Centering and Tempered Word Mover Distance

Figure 2 for Improving Text Generation Evaluation with Batch Centering and Tempered Word Mover Distance

Figure 3 for Improving Text Generation Evaluation with Batch Centering and Tempered Word Mover Distance

Figure 4 for Improving Text Generation Evaluation with Batch Centering and Tempered Word Mover Distance

Abstract:Recent advances in automatic evaluation metrics for text have shown that deep contextualized word representations, such as those generated by BERT encoders, are helpful for designing metrics that correlate well with human judgements. At the same time, it has been argued that contextualized word representations exhibit sub-optimal statistical properties for encoding the true similarity between words or sentences. In this paper, we present two techniques for improving encoding representations for similarity metrics: a batch-mean centering strategy that improves statistical properties; and a computationally efficient tempered Word Mover Distance, for better fusion of the information in the contextualized word representations. We conduct numerical experiments that demonstrate the robustness of our techniques, reporting results over various BERT-backbone learned metrics and achieving state of the art correlation with human ratings on several benchmarks.

* EMNLP 2020 Eval4NLP Workshop

Via

Access Paper or Ask Questions

TeaForN: Teacher-Forcing with N-grams

Oct 09, 2020

Sebastian Goodman, Nan Ding, Radu Soricut

Figure 1 for TeaForN: Teacher-Forcing with N-grams

Figure 2 for TeaForN: Teacher-Forcing with N-grams

Figure 3 for TeaForN: Teacher-Forcing with N-grams

Figure 4 for TeaForN: Teacher-Forcing with N-grams

Abstract:Sequence generation models trained with teacher-forcing suffer from issues related to exposure bias and lack of differentiability across timesteps. Our proposed method, Teacher-Forcing with N-grams (TeaForN), addresses both these problems directly, through the use of a stack of N decoders trained to decode along a secondary time axis that allows model parameter updates based on N prediction steps. TeaForN can be used with a wide class of decoder architectures and requires minimal modifications from a standard teacher-forcing setup. Empirically, we show that TeaForN boosts generation quality on one Machine Translation benchmark, WMT 2014 English-French, and two News Summarization benchmarks, CNN/Dailymail and Gigaword.

* to be published in EMNLP 2020

Via

Access Paper or Ask Questions