Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Christian Federmann

Preliminary WMT24 Ranking of General MT Systems and LLMs

Jul 29, 2024

Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondrej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz(+11 more)

Figure 1 for Preliminary WMT24 Ranking of General MT Systems and LLMs

Figure 2 for Preliminary WMT24 Ranking of General MT Systems and LLMs

Figure 3 for Preliminary WMT24 Ranking of General MT Systems and LLMs

Figure 4 for Preliminary WMT24 Ranking of General MT Systems and LLMs

Abstract:This is the preliminary ranking of WMT24 General MT systems based on automatic metrics. The official ranking will be a human evaluation, which is superior to the automatic ranking and supersedes it. The purpose of this report is not to interpret any findings but only provide preliminary results to the participants of the General MT task that may be useful during the writing of the system submission.

Via

Access Paper or Ask Questions

Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies

Jan 12, 2024

Tom Kocmi, Vilém Zouhar, Christian Federmann, Matt Post

Abstract:Ten years ago a single metric, BLEU, governed progress in machine translation research. For better or worse, there is no such consensus today, and consequently it is difficult for researchers to develop and retain the kinds of heuristic intuitions about metric deltas that drove earlier research and deployment decisions. This paper investigates the "dynamic range" of a number of modern metrics in an effort to provide a collective understanding of the meaning of differences in scores both within and among metrics; in other words, we ask what point difference X in metric Y is required between two systems for humans to notice? We conduct our evaluation on a new large dataset, ToShip23, using it to discover deltas at which metrics achieve system-level differences that are meaningful to humans, which we measure by pairwise system accuracy. We additionally show that this method of establishing delta-accuracy is more stable than the standard use of statistical p-values in regards to testset size. Where data size permits, we also explore the effect of metric deltas and accuracy across finer-grained features such as translation direction, domain, and system closeness.

Via

Access Paper or Ask Questions

GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4

Oct 21, 2023

Tom Kocmi, Christian Federmann

Abstract:This paper introduces GEMBA-MQM, a GPT-based evaluation metric designed to detect translation quality errors, specifically for the quality estimation setting without the need for human reference translations. Based on the power of large language models (LLM), GEMBA-MQM employs a fixed three-shot prompting technique, querying the GPT-4 model to mark error quality spans. Compared to previous works, our method has language-agnostic prompts, thus avoiding the need for manual prompt preparation for new languages. While preliminary results indicate that GEMBA-MQM achieves state-of-the-art accuracy for system ranking, we advise caution when using it in academic works to demonstrate improvements over other methods due to its dependence on the proprietary, black-box GPT model.

* Accepted to WMT 2023

Via

Access Paper or Ask Questions

Large Language Models Are State-of-the-Art Evaluators of Translation Quality

Feb 28, 2023

Tom Kocmi, Christian Federmann

Figure 1 for Large Language Models Are State-of-the-Art Evaluators of Translation Quality

Figure 2 for Large Language Models Are State-of-the-Art Evaluators of Translation Quality

Figure 3 for Large Language Models Are State-of-the-Art Evaluators of Translation Quality

Figure 4 for Large Language Models Are State-of-the-Art Evaluators of Translation Quality

Abstract:We describe GEMBA, a GPT-based metric for assessment of translation quality, which works both with a reference translation and without. In our evaluation, we focus on zero-shot prompting, comparing four prompt variants in two modes, based on the availability of the reference. We investigate seven versions of GPT models, including ChatGPT. We show that our method for translation quality assessment only works with GPT 3.5 and larger models. Comparing to results from WMT22's Metrics shared task, our method achieves state-of-the-art accuracy in both modes when compared to MQM-based human labels. Our results are valid on the system level for all three WMT22 Metrics shared task language pairs, namely English into German, English into Russian, and Chinese into English. This provides a first glimpse into the usefulness of pre-trained, generative large language models for quality assessment of translations. We publicly release all our code and prompt templates used for the experiments described in this work, as well as all corresponding scoring results, to allow for external validation and reproducibility.

* 10 pages, 8 tables, one figure

Via

Access Paper or Ask Questions

The JHU-Microsoft Submission for WMT21 Quality Estimation Shared Task

Sep 17, 2021

Shuoyang Ding, Marcin Junczys-Dowmunt, Matt Post, Christian Federmann, Philipp Koehn

Figure 1 for The JHU-Microsoft Submission for WMT21 Quality Estimation Shared Task

Figure 2 for The JHU-Microsoft Submission for WMT21 Quality Estimation Shared Task

Figure 3 for The JHU-Microsoft Submission for WMT21 Quality Estimation Shared Task

Figure 4 for The JHU-Microsoft Submission for WMT21 Quality Estimation Shared Task

Abstract:This paper presents the JHU-Microsoft joint submission for WMT 2021 quality estimation shared task. We only participate in Task 2 (post-editing effort estimation) of the shared task, focusing on the target-side word-level quality estimation. The techniques we experimented with include Levenshtein Transformer training and data augmentation with a combination of forward, backward, round-trip translation, and pseudo post-editing of the MT output. We demonstrate the competitiveness of our system compared to the widely adopted OpenKiwi-XLM baseline. Our system is also the top-ranking system on the MT MCC metric for the English-German language pair.

* 7 Pages, Accepted to WMT21 (System Description)

Via

Access Paper or Ask Questions

To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation

Jul 22, 2021

Tom Kocmi, Christian Federmann, Roman Grundkiewicz, Marcin Junczys-Dowmunt, Hitokazu Matsushita, Arul Menezes

Figure 1 for To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation

Figure 2 for To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation

Figure 3 for To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation

Figure 4 for To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation

Abstract:Automatic metrics are commonly used as the exclusive tool for declaring the superiority of one machine translation system's quality over another. The community choice of automatic metric guides research directions and industrial developments by deciding which models are deemed better. Evaluating metrics correlations has been limited to a small collection of human judgements. In this paper, we corroborate how reliable metrics are in contrast to human judgements on - to the best of our knowledge - the largest collection of human judgements. We investigate which metrics have the highest accuracy to make system-level quality rankings for pairs of systems, taking human judgement as a gold standard, which is the closest scenario to the real metric usage. Furthermore, we evaluate the performance of various metrics across different language pairs and domains. Lastly, we show that the sole use of BLEU negatively affected the past development of improved models. We release the collection of human judgements of 4380 systems, and 2.3 M annotated sentences for further analysis and replication of our work.

Via

Access Paper or Ask Questions

On User Interfaces for Large-Scale Document-Level Human Evaluation of Machine Translation Outputs

Apr 21, 2021

Roman Grundkiewicz, Marcin Junczys-Dowmunt, Christian Federmann, Tom Kocmi

Figure 1 for On User Interfaces for Large-Scale Document-Level Human Evaluation of Machine Translation Outputs

Figure 2 for On User Interfaces for Large-Scale Document-Level Human Evaluation of Machine Translation Outputs

Figure 3 for On User Interfaces for Large-Scale Document-Level Human Evaluation of Machine Translation Outputs

Figure 4 for On User Interfaces for Large-Scale Document-Level Human Evaluation of Machine Translation Outputs

Abstract:Recent studies emphasize the need of document context in human evaluation of machine translations, but little research has been done on the impact of user interfaces on annotator productivity and the reliability of assessments. In this work, we compare human assessment data from the last two WMT evaluation campaigns collected via two different methods for document-level evaluation. Our analysis shows that a document-centric approach to evaluation where the annotator is presented with the entire document context on a screen leads to higher quality segment and document level assessments. It improves the correlation between segment and document scores and increases inter-annotator agreement for document scores but is considerably more time consuming for annotators.

* Presented at HumEval, EACL 2021

Via

Access Paper or Ask Questions

Achieving Human Parity on Automatic Chinese to English News Translation

Jun 29, 2018

Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li(+14 more)

Figure 1 for Achieving Human Parity on Automatic Chinese to English News Translation

Figure 2 for Achieving Human Parity on Automatic Chinese to English News Translation

Figure 3 for Achieving Human Parity on Automatic Chinese to English News Translation

Figure 4 for Achieving Human Parity on Automatic Chinese to English News Translation

Abstract:Machine translation has made rapid advances in recent years. Millions of people are using it today in online translation systems and mobile applications in order to communicate across language barriers. The question naturally arises whether such systems can approach or achieve parity with human translations. In this paper, we first address the problem of how to define and accurately measure human parity in translation. We then describe Microsoft's machine translation system and measure the quality of its translations on the widely used WMT 2017 news translation task from Chinese to English. We find that our latest neural machine translation system has reached a new state-of-the-art, and that the translation quality is at human parity when compared to professional human translations. We also find that it significantly exceeds the quality of crowd-sourced non-professional translations.

Via

Access Paper or Ask Questions