Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ananth Gottumukkala

Evaluating NLP Models via Contrast Sets

Apr 06, 2020

Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala(+16 more)

Figure 1 for Evaluating NLP Models via Contrast Sets

Figure 2 for Evaluating NLP Models via Contrast Sets

Figure 3 for Evaluating NLP Models via Contrast Sets

Figure 4 for Evaluating NLP Models via Contrast Sets

Abstract:Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets---up to 25\% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.

Via

Access Paper or Ask Questions

ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine Reading Comprehension

Dec 29, 2019

Dheeru Dua, Ananth Gottumukkala, Alon Talmor, Sameer Singh, Matt Gardner

Figure 1 for ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine Reading Comprehension

Figure 2 for ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine Reading Comprehension

Figure 3 for ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine Reading Comprehension

Figure 4 for ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine Reading Comprehension

Abstract:Reading comprehension is one of the crucial tasks for furthering research in natural language understanding. A lot of diverse reading comprehension datasets have recently been introduced to study various phenomena in natural language, ranging from simple paraphrase matching and entity typing to entity tracking and understanding the implications of the context. Given the availability of many such datasets, comprehensive and reliable evaluation is tedious and time-consuming for researchers working on this problem. We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets, encouraging and facilitating testing a single model's capability in understanding a wide variety of reading phenomena. The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning for general reading facility. As more suitable datasets are released, they will be added to the evaluation server. We also collect and include synthetic augmentations for these datasets, testing how well models can handle out-of-domain questions.

Via

Access Paper or Ask Questions