Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhiyi Ma

Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking

May 21, 2021

Zhiyi Ma, Kawin Ethayarajh, Tristan Thrush, Somya Jain, Ledell Wu, Robin Jia, Christopher Potts, Adina Williams, Douwe Kiela

Figure 1 for Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking

Figure 2 for Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking

Figure 3 for Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking

Figure 4 for Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking

Abstract:We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison, integrated with the Dynabench platform. Our platform evaluates NLP models directly instead of relying on self-reported metrics or predictions on a single dataset. Under this paradigm, models are submitted to be evaluated in the cloud, circumventing the issues of reproducibility, accessibility, and backwards compatibility that often hinder benchmarking in NLP. This allows users to interact with uploaded models in real time to assess their quality, and permits the collection of additional metrics such as memory use, throughput, and robustness, which -- despite their importance to practitioners -- have traditionally been absent from leaderboards. On each task, models are ranked according to the Dynascore, a novel utility-based aggregation of these statistics, which users can customize to better reflect their preferences, placing more/less weight on a particular axis of evaluation or dataset. As state-of-the-art NLP models push the limits of traditional benchmarks, Dynaboard offers a standardized solution for a more diverse and comprehensive evaluation of model quality.

Via

Access Paper or Ask Questions

Dynabench: Rethinking Benchmarking in NLP

Apr 07, 2021

Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia(+9 more)

Figure 1 for Dynabench: Rethinking Benchmarking in NLP

Figure 2 for Dynabench: Rethinking Benchmarking in NLP

Figure 3 for Dynabench: Rethinking Benchmarking in NLP

Abstract:We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.

* NAACL 2021

Via

Access Paper or Ask Questions

A Comparison of Approaches to Document-level Machine Translation

Jan 26, 2021

Zhiyi Ma, Sergey Edunov, Michael Auli

Figure 1 for A Comparison of Approaches to Document-level Machine Translation

Figure 2 for A Comparison of Approaches to Document-level Machine Translation

Figure 3 for A Comparison of Approaches to Document-level Machine Translation

Figure 4 for A Comparison of Approaches to Document-level Machine Translation

Abstract:Document-level machine translation conditions on surrounding sentences to produce coherent translations. There has been much recent work in this area with the introduction of custom model architectures and decoding algorithms. This paper presents a systematic comparison of selected approaches from the literature on two benchmarks for which document-level phenomena evaluation suites exist. We find that a simple method based purely on back-translating monolingual document-level data performs as well as much more elaborate alternatives, both in terms of document-level metrics as well as human evaluation.

* 10 pages, 5 tables

Via

Access Paper or Ask Questions

Beyond English-Centric Multilingual Machine Translation

Oct 21, 2020

Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary(+7 more)

Figure 1 for Beyond English-Centric Multilingual Machine Translation

Figure 2 for Beyond English-Centric Multilingual Machine Translation

Figure 3 for Beyond English-Centric Multilingual Machine Translation

Figure 4 for Beyond English-Centric Multilingual Machine Translation

Abstract:Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT. We open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model.

Via

Access Paper or Ask Questions