Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nitika Mathur

Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy

Sep 15, 2024

Brian Thompson, Nitika Mathur, Daniel Deutsch, Huda Khayrallah

Figure 1 for Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy

Figure 2 for Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy

Figure 3 for Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy

Figure 4 for Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy

Abstract:Selecting an automatic metric that best emulates human judgments is often non-trivial, because there is no clear definition of "best emulates." A meta-metric is required to compare the human judgments to the automatic metric judgments, and metric rankings depend on the choice of meta-metric. We propose Soft Pairwise Accuracy (SPA), a new meta-metric that builds on Pairwise Accuracy (PA) but incorporates the statistical significance of both the human judgments and the metric judgments. SPA allows for more fine-grained comparisons between systems than a simplistic binary win/loss, and addresses a number of shortcomings with PA: it is more stable with respect to both the number of systems and segments used for evaluation, it mitigates the issue of metric ties due to quantization, and it produces more statistically significant results. SPA was selected as the official system-level metric for the 2024 WMT metric shared task.

Via

Access Paper or Ask Questions

Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Jun 12, 2020

Nitika Mathur, Timothy Baldwin, Trevor Cohn

Figure 1 for Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Figure 2 for Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Figure 3 for Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Figure 4 for Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Abstract:Automatic metrics are fundamental for the development and evaluation of machine translation systems. Judging whether, and to what extent, automatic metrics concur with the gold standard of human evaluation is not a straightforward problem. We show that current methods for judging metrics are highly sensitive to the translations used for assessment, particularly the presence of outliers, which often leads to falsely confident conclusions about a metric's efficacy. Finally, we turn to pairwise system ranking, developing a method for thresholding performance improvement under an automatic metric against human judgements, which allows quantification of type I versus type II errors incurred, i.e., insignificant human differences in system quality that are accepted, and significant human differences that are rejected. Together, these findings suggest improvements to the protocols for metric evaluation and system performance evaluation in machine translation.

* Accepted at ACL 2020

Via

Access Paper or Ask Questions