Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Maria Nadejde

M3T: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation

Jun 12, 2024

Benjamin Hsu, Xiaoyu Liu, Huayang Li, Yoshinari Fujinuma, Maria Nadejde, Xing Niu, Yair Kittenplon, Ron Litman, Raghavendra Pappagari

Figure 1 for M3T: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation

Figure 2 for M3T: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation

Figure 3 for M3T: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation

Figure 4 for M3T: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation

Abstract:Document translation poses a challenge for Neural Machine Translation (NMT) systems. Most document-level NMT systems rely on meticulously curated sentence-level parallel data, assuming flawless extraction of text from documents along with their precise reading order. These systems also tend to disregard additional visual cues such as the document layout, deeming it irrelevant. However, real-world documents often possess intricate text layouts that defy these assumptions. Extracting information from Optical Character Recognition (OCR) or heuristic rules can result in errors, and the layout (e.g., paragraphs, headers) may convey relationships between distant sections of text. This complexity is particularly evident in widely used PDF documents, which represent information visually. This paper addresses this gap by introducing M3T, a novel benchmark dataset tailored to evaluate NMT systems on the comprehensive task of translating semi-structured documents. This dataset aims to bridge the evaluation gap in document-level NMT systems, acknowledging the challenges posed by rich text layouts in real-world applications.

* NAACL 2024, dataset at https://github.com/amazon-science/m3t-multi-modal-translation-bench

Via

Access Paper or Ask Questions

RAMP: Retrieval and Attribute-Marking Enhanced Prompting for Attribute-Controlled Translation

May 26, 2023

Gabriele Sarti, Phu Mon Htut, Xing Niu, Benjamin Hsu, Anna Currey, Georgiana Dinu, Maria Nadejde

Figure 1 for RAMP: Retrieval and Attribute-Marking Enhanced Prompting for Attribute-Controlled Translation

Figure 2 for RAMP: Retrieval and Attribute-Marking Enhanced Prompting for Attribute-Controlled Translation

Figure 3 for RAMP: Retrieval and Attribute-Marking Enhanced Prompting for Attribute-Controlled Translation

Figure 4 for RAMP: Retrieval and Attribute-Marking Enhanced Prompting for Attribute-Controlled Translation

Abstract:Attribute-controlled translation (ACT) is a subtask of machine translation that involves controlling stylistic or linguistic attributes (like formality and gender) of translation outputs. While ACT has garnered attention in recent years due to its usefulness in real-world applications, progress in the task is currently limited by dataset availability, since most prior approaches rely on supervised methods. To address this limitation, we propose Retrieval and Attribute-Marking enhanced Prompting (RAMP), which leverages large multilingual language models to perform ACT in few-shot and zero-shot settings. RAMP improves generation accuracy over the standard prompting approach by (1) incorporating a semantic similarity retrieval component for selecting similar in-context examples, and (2) marking in-context examples with attribute annotations. Our comprehensive experiments show that RAMP is a viable approach in both zero-shot and few-shot settings.

* Accepted at ACL 2023

Via

Access Paper or Ask Questions

Sockeye 3: Fast Neural Machine Translation with PyTorch

Jul 12, 2022

Felix Hieber, Michael Denkowski, Tobias Domhan, Barbara Darques Barros, Celina Dong Ye, Xing Niu, Cuong Hoang, Ke Tran, Benjamin Hsu, Maria Nadejde(+4 more)

Figure 1 for Sockeye 3: Fast Neural Machine Translation with PyTorch

Figure 2 for Sockeye 3: Fast Neural Machine Translation with PyTorch

Figure 3 for Sockeye 3: Fast Neural Machine Translation with PyTorch

Figure 4 for Sockeye 3: Fast Neural Machine Translation with PyTorch

Abstract:Sockeye 3 is the latest version of the Sockeye toolkit for Neural Machine Translation (NMT). Now based on PyTorch, Sockeye 3 provides faster model implementations and more advanced features with a further streamlined codebase. This enables broader experimentation with faster iteration, efficient training of stronger and faster models, and the flexibility to move new ideas quickly from research to production. When running comparable models, Sockeye 3 is up to 126% faster than other PyTorch implementations on GPUs and up to 292% faster on CPUs. Sockeye 3 is open source software released under the Apache 2.0 license.

Via

Access Paper or Ask Questions

Personalizing Grammatical Error Correction: Adaptation to Proficiency Level and L1

Jun 04, 2020

Maria Nadejde, Joel Tetreault

Figure 1 for Personalizing Grammatical Error Correction: Adaptation to Proficiency Level and L1

Figure 2 for Personalizing Grammatical Error Correction: Adaptation to Proficiency Level and L1

Figure 3 for Personalizing Grammatical Error Correction: Adaptation to Proficiency Level and L1

Figure 4 for Personalizing Grammatical Error Correction: Adaptation to Proficiency Level and L1

Abstract:Grammar error correction (GEC) systems have become ubiquitous in a variety of software applications, and have started to approach human-level performance for some datasets. However, very little is known about how to efficiently personalize these systems to the user's characteristics, such as their proficiency level and first language, or to emerging domains of text. We present the first results on adapting a general-purpose neural GEC system to both the proficiency level and the first language of a writer, using only a few thousand annotated sentences. Our study is the broadest of its kind, covering five proficiency levels and twelve different languages, and comparing three different adaptation scenarios: adapting to the proficiency level only, to the first language only, or to both aspects simultaneously. We show that tailoring to both scenarios achieves the largest performance improvement (3.6 F0.5) relative to a strong baseline.

* Proceedings of the 2019 EMNLP Workshop W-NUT: The 5th Workshop on Noisy User-generated Text, pages 27-33, Hong Kong, Nov 4, 2019
* Proceedings of the 2019 EMNLP Workshop W-NUT: The 5th Workshop on Noisy User-generated Text

Via

Access Paper or Ask Questions

Predicting Target Language CCG Supertags Improves Neural Machine Translation

Jul 18, 2017

Maria Nadejde, Siva Reddy, Rico Sennrich, Tomasz Dwojak, Marcin Junczys-Dowmunt, Philipp Koehn, Alexandra Birch

Figure 1 for Predicting Target Language CCG Supertags Improves Neural Machine Translation

Figure 2 for Predicting Target Language CCG Supertags Improves Neural Machine Translation

Figure 3 for Predicting Target Language CCG Supertags Improves Neural Machine Translation

Figure 4 for Predicting Target Language CCG Supertags Improves Neural Machine Translation

Abstract:Neural machine translation (NMT) models are able to partially learn syntactic information from sequential lexical information. Still, some complex syntactic phenomena such as prepositional phrase attachment are poorly modeled. This work aims to answer two questions: 1) Does explicitly modeling target language syntax help NMT? 2) Is tight integration of words and syntax better than multitask training? We introduce syntactic information in the form of CCG supertags in the decoder, by interleaving the target supertags with the word sequence. Our results on WMT data show that explicitly modeling target-syntax improves machine translation quality for German->English, a high-resource pair, and for Romanian->English, a low-resource pair and also several syntactic phenomena including prepositional phrase attachment. Furthermore, a tight coupling of words and syntax improves translation quality more than multitask training. By combining target-syntax with adding source-side dependency labels in the embedding layer, we obtain a total improvement of 0.9 BLEU for German->English and 1.2 BLEU for Romanian->English.

* Accepted at the Second Conference on Machine Translation (WMT17). This version includes more results regarding target syntax for Romanian->English and reports fewer results regarding source syntax

Via

Access Paper or Ask Questions