Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lorenzo Proietti

PEAR: Pairwise Evaluation for Automatic Relative Scoring in Machine Translation

Jan 25, 2026

Lorenzo Proietti, Roman Grundkiewicz, Matt Post

Abstract:We present PEAR (Pairwise Evaluation for Automatic Relative Scoring), a supervised Quality Estimation (QE) metric family that reframes reference-free Machine Translation (MT) evaluation as a graded pairwise comparison. Given a source segment and two candidate translations, PEAR predicts the direction and magnitude of their quality difference. The metrics are trained using pairwise supervision derived from differences in human judgments, with an additional regularization term that encourages sign inversion under candidate order reversal. On the WMT24 meta-evaluation benchmark, PEAR outperforms strictly matched single-candidate QE baselines trained with the same data and backbones, isolating the benefit of the proposed pairwise formulation. Despite using substantially fewer parameters than recent large metrics, PEAR surpasses far larger QE models and reference-based metrics. Our analysis further indicates that PEAR yields a less redundant evaluation signal relative to other top metrics. Finally, we show that PEAR is an effective utility function for Minimum Bayes Risk (MBR) decoding, reducing pairwise scoring cost at negligible impact.

* 18 pages

Via

Access Paper or Ask Questions

Estimating Machine Translation Difficulty

Aug 13, 2025

Lorenzo Proietti, Stefano Perrella, Vilém Zouhar, Roberto Navigli, Tom Kocmi

Figure 1 for Estimating Machine Translation Difficulty

Figure 2 for Estimating Machine Translation Difficulty

Figure 3 for Estimating Machine Translation Difficulty

Figure 4 for Estimating Machine Translation Difficulty

Abstract:Machine translation quality has began achieving near-perfect translations in some setups. These high-quality outputs make it difficult to distinguish between state-of-the-art models and to identify areas for future improvement. Automatically identifying texts where machine translation systems struggle holds promise for developing more discriminative evaluations and guiding future research. We formalize the task of translation difficulty estimation, defining a text's difficulty based on the expected quality of its translations. We introduce a new metric to evaluate difficulty estimators and use it to assess both baselines and novel approaches. Finally, we demonstrate the practical utility of difficulty estimators by using them to construct more challenging machine translation benchmarks. Our results show that dedicated models (dubbed Sentinel-src) outperform both heuristic-based methods (e.g. word rarity or syntactic complexity) and LLM-as-a-judge approaches. We release two improved models for difficulty estimation, Sentinel-src-24 and Sentinel-src-25, which can be used to scan large collections of texts and select those most likely to challenge contemporary machine translation systems.

Via

Access Paper or Ask Questions

Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress

Jun 24, 2025

Lorenzo Proietti, Stefano Perrella, Roberto Navigli

Figure 1 for Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress

Figure 2 for Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress

Figure 3 for Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress

Figure 4 for Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress

Abstract:In Machine Translation (MT) evaluation, metric performance is assessed based on agreement with human judgments. In recent years, automatic metrics have demonstrated increasingly high levels of agreement with humans. To gain a clearer understanding of metric performance and establish an upper bound, we incorporate human baselines in the MT meta-evaluation, that is, the assessment of MT metrics' capabilities. Our results show that human annotators are not consistently superior to automatic metrics, with state-of-the-art metrics often ranking on par with or higher than human baselines. Despite these findings suggesting human parity, we discuss several reasons for caution. Finally, we explore the broader implications of our results for the research field, asking: Can we still reliably measure improvements in MT evaluation? With this work, we aim to shed light on the limits of our ability to measure progress in the field, fostering discussion on an issue that we believe is crucial to the entire MT evaluation community.

* Accepted at ACL 2025 Main Conference. 24 pages

Via

Access Paper or Ask Questions

Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics

Oct 07, 2024

Stefano Perrella, Lorenzo Proietti, Pere-Lluís Huguet Cabot, Edoardo Barba, Roberto Navigli

Figure 1 for Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics

Figure 2 for Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics

Figure 3 for Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics

Figure 4 for Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics

Abstract:Machine Translation (MT) evaluation metrics assess translation quality automatically. Recently, researchers have employed MT metrics for various new use cases, such as data filtering and translation re-ranking. However, most MT metrics return assessments as scalar scores that are difficult to interpret, posing a challenge to making informed design choices. Moreover, MT metrics' capabilities have historically been evaluated using correlation with human judgment, which, despite its efficacy, falls short of providing intuitive insights into metric performance, especially in terms of new metric use cases. To address these issues, we introduce an interpretable evaluation framework for MT metrics. Within this framework, we evaluate metrics in two scenarios that serve as proxies for the data filtering and translation re-ranking use cases. Furthermore, by measuring the performance of MT metrics using Precision, Recall, and F-score, we offer clearer insights into their capabilities than correlation with human judgments. Finally, we raise concerns regarding the reliability of manually curated data following the Direct Assessments+Scalar Quality Metrics (DA+SQM) guidelines, reporting a notably low agreement with Multidimensional Quality Metrics (MQM) annotations.

* Accepted at EMNLP 2024 Main Conference. 26 pages

Via

Access Paper or Ask Questions

Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In!

Aug 25, 2024

Stefano Perrella, Lorenzo Proietti, Alessandro Scirè, Edoardo Barba, Roberto Navigli

Figure 1 for Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In!

Figure 2 for Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In!

Figure 3 for Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In!

Figure 4 for Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In!

Abstract:Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics, ranking them according to their correlation with human judgments. Their results guide researchers toward enhancing the next generation of metrics and MT systems. With the recent introduction of neural metrics, the field has witnessed notable advancements. Nevertheless, the inherent opacity of these metrics has posed substantial challenges to the meta-evaluation process. This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings. To do this, we introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness. By employing sentinel metrics, we aim to validate our findings, and shed light on and monitor the potential biases or inconsistencies in the rankings. We discover that the present meta-evaluation framework favors two categories of metrics: i) those explicitly trained to mimic human quality assessments, and ii) continuous metrics. Finally, we raise concerns regarding the evaluation capabilities of state-of-the-art metrics, emphasizing that they might be basing their assessments on spurious correlations found in their training data.

* Presented at ACL 2024 Main Conference. 29 pages

Via

Access Paper or Ask Questions