Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tomer Levinboim

CausalLM is not optimal for in-context learning

Sep 03, 2023

Nan Ding, Tomer Levinboim, Jialin Wu, Sebastian Goodman, Radu Soricut

Figure 1 for CausalLM is not optimal for in-context learning

Figure 2 for CausalLM is not optimal for in-context learning

Figure 3 for CausalLM is not optimal for in-context learning

Figure 4 for CausalLM is not optimal for in-context learning

Abstract:Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model (prefixLM), in which in-context samples can all attend to each other, compared to causal language models (causalLM), which use auto-regressive attention that prohibits in-context samples to attend to future samples. While this result is intuitive, it is not understood from a theoretical perspective. In this paper we take a theoretical approach and analyze the convergence behavior of prefixLM and causalLM under a certain parameter construction. Our analysis shows that both LM types converge to their stationary points at a linear rate, but that while prefixLM converges to the optimal solution of linear regression, causalLM convergence dynamics follows that of an online gradient descent algorithm, which is not guaranteed to be optimal even as the number of samples grows infinitely. We supplement our theoretical claims with empirical experiments over synthetic and real tasks and using various types of transformers. Our experiments verify that causalLM consistently underperforms prefixLM in all settings.

Via

Access Paper or Ask Questions

Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization

Nov 22, 2022

Zifan Wang, Nan Ding, Tomer Levinboim, Xi Chen, Radu Soricut

Abstract:Recent research in robust optimization has shown an overfitting-like phenomenon in which models trained against adversarial attacks exhibit higher robustness on the training set compared to the test set. Although previous work provided theoretical explanations for this phenomenon using a robust PAC-Bayesian bound over the adversarial test error, related algorithmic derivations are at best only loosely connected to this bound, which implies that there is still a gap between their empirical success and our understanding of adversarial robustness theory. To close this gap, in this paper we consider a different form of the robust PAC-Bayesian bound and directly minimize it with respect to the model posterior. The derivation of the optimal solution connects PAC-Bayesian learning to the geometry of the robust loss surface through a Trace of Hessian (TrH) regularizer that measures the surface flatness. In practice, we restrict the TrH regularizer to the top layer only, which results in an analytical solution to the bound whose computational cost does not depend on the network depth. Finally, we evaluate our TrH regularization approach over CIFAR-10/100 and ImageNet using Vision Transformers (ViT) and compare against baseline adversarial robustness algorithms. Experimental results show that TrH regularization leads to improved ViT robustness that either matches or surpasses previous state-of-the-art approaches while at the same time requires less memory and computational cost.

Via

Access Paper or Ask Questions

PACTran: PAC-Bayesian Metrics for Estimating the Transferability of Pretrained Models to Classification Tasks

Mar 10, 2022

Nan Ding, Xi Chen, Tomer Levinboim, Beer Changpinyo, Radu Soricut

Figure 1 for PACTran: PAC-Bayesian Metrics for Estimating the Transferability of Pretrained Models to Classification Tasks

Figure 2 for PACTran: PAC-Bayesian Metrics for Estimating the Transferability of Pretrained Models to Classification Tasks

Figure 3 for PACTran: PAC-Bayesian Metrics for Estimating the Transferability of Pretrained Models to Classification Tasks

Figure 4 for PACTran: PAC-Bayesian Metrics for Estimating the Transferability of Pretrained Models to Classification Tasks

Abstract:With the increasing abundance of pretrained models in recent years, the problem of selecting the best pretrained checkpoint for a particular downstream classification task has been gaining increased attention. Although several methods have recently been proposed to tackle the selection problem (e.g. LEEP, H-score), these methods resort to applying heuristics that are not well motivated by learning theory. In this paper we present PACTran, a theoretically grounded family of metrics for pretrained model selection and transferability measurement. We first show how to derive PACTran metrics from the optimal PAC-Bayesian bound under the transfer learning setting. We then empirically evaluate three metric instantiations of PACTran on a number of vision tasks (VTAB) as well as a language-and-vision (OKVQA) task. An analysis of the results shows PACTran is a more consistent and effective transferability measure compared to existing selection methods.

Via

Access Paper or Ask Questions

Bridging the Gap Between Practice and PAC-Bayes Theory in Few-Shot Meta-Learning

May 28, 2021

Nan Ding, Xi Chen, Tomer Levinboim, Sebastian Goodman, Radu Soricut

Figure 1 for Bridging the Gap Between Practice and PAC-Bayes Theory in Few-Shot Meta-Learning

Figure 2 for Bridging the Gap Between Practice and PAC-Bayes Theory in Few-Shot Meta-Learning

Figure 3 for Bridging the Gap Between Practice and PAC-Bayes Theory in Few-Shot Meta-Learning

Figure 4 for Bridging the Gap Between Practice and PAC-Bayes Theory in Few-Shot Meta-Learning

Abstract:Despite recent advances in its theoretical understanding, there still remains a significant gap in the ability of existing PAC-Bayesian theories on meta-learning to explain performance improvements in the few-shot learning setting, where the number of training examples in the target tasks is severely limited. This gap originates from an assumption in the existing theories which supposes that the number of training examples in the observed tasks and the number of training examples in the target tasks follow the same distribution, an assumption that rarely holds in practice. By relaxing this assumption, we develop two PAC-Bayesian bounds tailored for the few-shot learning setting and show that two existing meta-learning algorithms (MAML and Reptile) can be derived from our bounds, thereby bridging the gap between practice and PAC-Bayesian theories. Furthermore, we derive a new computationally-efficient PACMAML algorithm, and show it outperforms existing meta-learning algorithms on several few-shot benchmark datasets.

Via

Access Paper or Ask Questions

Improving Text Generation Evaluation with Batch Centering and Tempered Word Mover Distance

Oct 13, 2020

Xi Chen, Nan Ding, Tomer Levinboim, Radu Soricut

Figure 1 for Improving Text Generation Evaluation with Batch Centering and Tempered Word Mover Distance

Figure 2 for Improving Text Generation Evaluation with Batch Centering and Tempered Word Mover Distance

Figure 3 for Improving Text Generation Evaluation with Batch Centering and Tempered Word Mover Distance

Figure 4 for Improving Text Generation Evaluation with Batch Centering and Tempered Word Mover Distance

Abstract:Recent advances in automatic evaluation metrics for text have shown that deep contextualized word representations, such as those generated by BERT encoders, are helpful for designing metrics that correlate well with human judgements. At the same time, it has been argued that contextualized word representations exhibit sub-optimal statistical properties for encoding the true similarity between words or sentences. In this paper, we present two techniques for improving encoding representations for similarity metrics: a batch-mean centering strategy that improves statistical properties; and a computationally efficient tempered Word Mover Distance, for better fusion of the information in the contextualized word representations. We conduct numerical experiments that demonstrate the robustness of our techniques, reporting results over various BERT-backbone learned metrics and achieving state of the art correlation with human ratings on several benchmarks.

* EMNLP 2020 Eval4NLP Workshop

Via

Access Paper or Ask Questions

Reinforcing an Image Caption Generator Using Off-Line Human Feedback

Nov 21, 2019

Paul Hongsuck Seo, Piyush Sharma, Tomer Levinboim, Bohyung Han, Radu Soricut

Figure 1 for Reinforcing an Image Caption Generator Using Off-Line Human Feedback

Figure 2 for Reinforcing an Image Caption Generator Using Off-Line Human Feedback

Figure 3 for Reinforcing an Image Caption Generator Using Off-Line Human Feedback

Figure 4 for Reinforcing an Image Caption Generator Using Off-Line Human Feedback

Abstract:Human ratings are currently the most accurate way to assess the quality of an image captioning model, yet most often the only used outcome of an expensive human rating evaluation is a few overall statistics over the evaluation dataset. In this paper, we show that the signal from instance-level human caption ratings can be leveraged to improve captioning models, even when the amount of caption ratings is several orders of magnitude less than the caption training data. We employ a policy gradient method to maximize the human ratings as rewards in an off-policy reinforcement learning setting, where policy gradients are estimated by samples from a distribution that focuses on the captions in a caption ratings dataset. Our empirical evidence indicates that the proposed method learns to generalize the human raters' judgments to a previously unseen set of images, as judged by a different set of human judges, and additionally on a different, multi-dimensional side-by-side human evaluation procedure.

* AAAI 2020

Via

Access Paper or Ask Questions

Quality Estimation for Image Captions Based on Large-scale Human Evaluations

Sep 08, 2019

Tomer Levinboim, Ashish Thapliyal, Piyush Sharma, Radu Soricut

Figure 1 for Quality Estimation for Image Captions Based on Large-scale Human Evaluations

Figure 2 for Quality Estimation for Image Captions Based on Large-scale Human Evaluations

Figure 3 for Quality Estimation for Image Captions Based on Large-scale Human Evaluations

Figure 4 for Quality Estimation for Image Captions Based on Large-scale Human Evaluations

Abstract:Automatic image captioning has improved significantly in the last few years, but the problem is far from being solved. Furthermore, while the standard automatic metrics, such as CIDEr and SPICE~\cite{cider,spice}, can be used for model selection, they cannot be used at inference-time given a previously unseen image since they require ground-truth references. In this paper, we focus on the related problem called Quality Estimation (QE) of image-captions. In contrast to automatic metrics, QE attempts to model caption quality without relying on ground-truth references. It can thus be applied as a second-pass model (after caption generation) to estimate the quality of captions even for previously unseen images. We conduct a large-scale human evaluation experiment, in which we collect a new dataset of more than 600k ratings of image-caption pairs. Using this dataset, we design and experiment with several QE modeling approaches and provide an analysis of their performance. Our results show that QE is feasible for image captioning.

* 10 pages (8+2), 5 figures, 3 tables

Via

Access Paper or Ask Questions

Informative Image Captioning with External Sources of Information

Jun 20, 2019

Sanqiang Zhao, Piyush Sharma, Tomer Levinboim, Radu Soricut

Figure 1 for Informative Image Captioning with External Sources of Information

Figure 2 for Informative Image Captioning with External Sources of Information

Figure 3 for Informative Image Captioning with External Sources of Information

Figure 4 for Informative Image Captioning with External Sources of Information

Abstract:An image caption should fluently present the essential information in a given image, including informative, fine-grained entity mentions and the manner in which these entities interact. However, current captioning models are usually trained to generate captions that only contain common object names, thus falling short on an important "informativeness" dimension. We present a mechanism for integrating image information together with fine-grained labels (assumed to be generated by some upstream models) into a caption that describes the image in a fluent and informative manner. We introduce a multimodal, multi-encoder model based on Transformer that ingests both image features and multiple sources of entity labels. We demonstrate that we can learn to control the appearance of these entity labels in the output, resulting in captions that are both fluent and informative.

Via

Access Paper or Ask Questions