Abstract:Bayesian Pseudo-Coreset (BPC) and Dataset Condensation are two parallel streams of work that construct a synthetic set such that, a model trained independently on this synthetic set, yields the same performance as training on the original training set. While dataset condensation methods use non-bayesian, heuristic ways to construct such a synthetic set, BPC methods take a bayesian approach and formulate the problem as divergence minimization between posteriors associated with original data and synthetic data. However, BPC methods generally rely on distributional assumptions on these posteriors which makes them less flexible and hinders their performance. In this work, we propose to solve these issues by modeling the posterior associated with synthetic data by an energy-based distribution. We derive a contrastive-divergence-like loss function to learn the synthetic set and show a simple and efficient way to estimate this loss. Further, we perform rigorous experiments pertaining to the proposed method. Our experiments on multiple datasets show that the proposed method not only outperforms previous BPC methods but also gives performance comparable to dataset condensation counterparts.
Abstract:Recent work has demonstrated substantial gains in pre-training large-scale unidirectional language models such as the GPT-2, GPT-3, and GPT-neo, followed by fine-tuning on a downstream task. In this paper, we evaluate the performance of the GPT-neo 1.3 billion model for commonsense reasoning tasks. We assess the model performance on six commonsense reasoning benchmark tasks and report the accuracy scores for these tasks. When fine-tuned using the right set of hyperparameters, we obtain competitive scores on three of these tasks but struggle when the dataset size is significantly smaller. The low model performance on a few of these tasks suggests the inherent difficulty in these datasets and since it fails to establish coherent patterns given their limited training samples. We also investigate and substantiate our results using visualization and conduct numerous inference tests to understand the model performance better. Finally, we conduct thorough robustness tests using various methods to gauge the model performance under numerous settings. These findings suggest a promising path for exploring smaller language models than the GPT-3 175 billion model to perform tasks requiring natural language understanding.