Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Masaaki Nagata

Preliminary WMT24 Ranking of General MT Systems and LLMs

Jul 29, 2024

Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondrej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz(+11 more)

Figure 1 for Preliminary WMT24 Ranking of General MT Systems and LLMs

Figure 2 for Preliminary WMT24 Ranking of General MT Systems and LLMs

Figure 3 for Preliminary WMT24 Ranking of General MT Systems and LLMs

Figure 4 for Preliminary WMT24 Ranking of General MT Systems and LLMs

Abstract:This is the preliminary ranking of WMT24 General MT systems based on automatic metrics. The official ranking will be a human evaluation, which is superior to the automatic ranking and supersedes it. The purpose of this report is not to interpret any findings but only provide preliminary results to the participants of the General MT task that may be useful during the writing of the system submission.

Via

Access Paper or Ask Questions

Enhancing Translation Accuracy of Large Language Models through Continual Pre-Training on Parallel Data

Jul 03, 2024

Minato Kondo, Takehito Utsuro, Masaaki Nagata

Abstract:In this paper, we propose a two-phase training approach where pre-trained large language models are continually pre-trained on parallel data and then supervised fine-tuned with a small amount of high-quality parallel data. To investigate the effectiveness of our proposed approach, we conducted continual pre-training with a 3.8B-parameter model and parallel data across eight different formats. We evaluate these methods on thirteen test sets for Japanese-to-English and English-to-Japanese translation. The results demonstrate that when utilizing parallel data in continual pre-training, it is essential to alternate between source and target sentences. Additionally, we demonstrated that the translation accuracy improves only for translation directions where the order of source and target sentences aligns between continual pre-training data and inference. In addition, we demonstrate that the LLM-based translation model is more robust in translating spoken language and achieves higher accuracy with less training data compared to supervised encoder-decoder models. We also show that the highest accuracy is achieved when the data for continual pre-training consists of interleaved source and target sentences and when tags are added to the source sentences.

* IWSLT2024, 18 pages

Via

Access Paper or Ask Questions

Word Alignment as Preference for Machine Translation

May 15, 2024

Qiyu Wu, Masaaki Nagata, Zhongtao Miao, Yoshimasa Tsuruoka

Abstract:The problem of hallucination and omission, a long-standing problem in machine translation (MT), is more pronounced when a large language model (LLM) is used in MT because an LLM itself is susceptible to these phenomena. In this work, we mitigate the problem in an LLM-based MT model by guiding it to better word alignment. We first study the correlation between word alignment and the phenomena of hallucination and omission in MT. Then we propose to utilize word alignment as preference to optimize the LLM-based MT model. The preference data are constructed by selecting chosen and rejected translations from multiple MT tools. Subsequently, direct preference optimization is used to optimize the LLM-based model towards the preference signal. Given the absence of evaluators specifically designed for hallucination and omission in MT, we further propose selecting hard instances and utilizing GPT-4 to directly evaluate the performance of the models in mitigating these issues. We verify the rationality of these designed evaluation methods by experiments, followed by extensive results demonstrating the effectiveness of word alignment-based preference optimization to mitigate hallucination and omission.

Via

Access Paper or Ask Questions

A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining

May 15, 2024

Masaaki Nagata, Makoto Morishita, Katsuki Chousa, Norihito Yasuda

Abstract:Using crowdsourcing, we collected more than 10,000 URL pairs (parallel top page pairs) of bilingual websites that contain parallel documents and created a Japanese-Chinese parallel corpus of 4.6M sentence pairs from these websites. We used a Japanese-Chinese bilingual dictionary of 160K word pairs for document and sentence alignment. We then used high-quality 1.2M Japanese-Chinese sentence pairs to train a parallel corpus filter based on statistical language models and word translation probabilities. We compared the translation accuracy of the model trained on these 4.6M sentence pairs with that of the model trained on Japanese-Chinese sentence pairs from CCMatrix (12.4M), a parallel corpus from global web mining. Although our corpus is only one-third the size of CCMatrix, we found that the accuracy of the two models was comparable and confirmed that it is feasible to use crowdsourcing for web mining of parallel data.

* Work in progress

Via

Access Paper or Ask Questions

WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction

Jun 09, 2023

Qiyu Wu, Masaaki Nagata, Yoshimasa Tsuruoka

Figure 1 for WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction

Figure 2 for WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction

Figure 3 for WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction

Figure 4 for WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction

Abstract:Most existing word alignment methods rely on manual alignment datasets or parallel corpora, which limits their usefulness. Here, to mitigate the dependence on manual data, we broaden the source of supervision by relaxing the requirement for correct, fully-aligned, and parallel sentences. Specifically, we make noisy, partially aligned, and non-parallel paragraphs. We then use such a large-scale weakly-supervised dataset for word alignment pre-training via span prediction. Extensive experiments with various settings empirically demonstrate that our approach, which is named WSPAlign, is an effective and scalable way to pre-train word aligners without manual data. When fine-tuned on standard benchmarks, WSPAlign has set a new state-of-the-art by improving upon the best-supervised baseline by 3.3~6.1 points in F1 and 1.5~6.1 points in AER. Furthermore, WSPAlign also achieves competitive performance compared with the corresponding baselines in few-shot, zero-shot and cross-lingual tests, which demonstrates that WSPAlign is potentially more practical for low-resource languages than existing methods.

* To appear at ACL 2023

Via

Access Paper or Ask Questions

Domain Adaptation of Machine Translation with Crowdworkers

Oct 28, 2022

Makoto Morishita, Jun Suzuki, Masaaki Nagata

Abstract:Although a machine translation model trained with a large in-domain parallel corpus achieves remarkable results, it still works poorly when no in-domain data are available. This situation restricts the applicability of machine translation when the target domain's data are limited. However, there is great demand for high-quality domain-specific machine translation models for many domains. We propose a framework that efficiently and effectively collects parallel sentences in a target domain from the web with the help of crowdworkers. With the collected parallel data, we can quickly adapt a machine translation model to the target domain. Our experiments show that the proposed method can collect target-domain parallel data over a few days at a reasonable cost. We tested it with five domains, and the domain-adapted model improved the BLEU scores to +19.7 by an average of +7.8 points compared to a general-purpose translation model.

* Accepted at EMNLP 2022 Industry Track

Via

Access Paper or Ask Questions

A Simple and Strong Baseline for End-to-End Neural RST-style Discourse Parsing

Oct 15, 2022

Naoki Kobayashi, Tsutomu Hirao, Hidetaka Kamigaito, Manabu Okumura, Masaaki Nagata

Figure 1 for A Simple and Strong Baseline for End-to-End Neural RST-style Discourse Parsing

Figure 2 for A Simple and Strong Baseline for End-to-End Neural RST-style Discourse Parsing

Figure 3 for A Simple and Strong Baseline for End-to-End Neural RST-style Discourse Parsing

Figure 4 for A Simple and Strong Baseline for End-to-End Neural RST-style Discourse Parsing

Abstract:To promote and further develop RST-style discourse parsing models, we need a strong baseline that can be regarded as a reference for reporting reliable experimental results. This paper explores a strong baseline by integrating existing simple parsing strategies, top-down and bottom-up, with various transformer-based pre-trained language models. The experimental results obtained from two benchmark datasets demonstrate that the parsing performance strongly relies on the pretrained language models rather than the parsing strategies. In particular, the bottom-up parser achieves large performance gains compared to the current best parser when employing DeBERTa. We further reveal that language models with a span-masking scheme especially boost the parsing performance through our analysis within intra- and multi-sentential parsing, and nuclearity prediction.

* Accepted in Findings of EMNLP 2022

Via

Access Paper or Ask Questions

Extending Word-Level Quality Estimation for Post-Editing Assistance

Sep 23, 2022

Yizhen Wei, Takehito Utsuro, Masaaki Nagata

Figure 1 for Extending Word-Level Quality Estimation for Post-Editing Assistance

Figure 2 for Extending Word-Level Quality Estimation for Post-Editing Assistance

Figure 3 for Extending Word-Level Quality Estimation for Post-Editing Assistance

Figure 4 for Extending Word-Level Quality Estimation for Post-Editing Assistance

Abstract:We define a novel concept called extended word alignment in order to improve post-editing assistance efficiency. Based on extended word alignment, we further propose a novel task called refined word-level QE that outputs refined tags and word-level correspondences. Compared to original word-level QE, the new task is able to directly point out editing operations, thus improves efficiency. To extract extended word alignment, we adopt a supervised method based on mBERT. To solve refined word-level QE, we firstly predict original QE tags by training a regression model for sequence tagging based on mBERT and XLM-R. Then, we refine original word tags with extended word alignment. In addition, we extract source-gap correspondences, meanwhile, obtaining gap tags. Experiments on two language pairs show the feasibility of our method and give us inspirations for further improvement.

Via

Access Paper or Ask Questions

JParaCrawl v3.0: A Large-scale English-Japanese Parallel Corpus

Feb 28, 2022

Makoto Morishita, Katsuki Chousa, Jun Suzuki, Masaaki Nagata

Figure 1 for JParaCrawl v3.0: A Large-scale English-Japanese Parallel Corpus

Figure 2 for JParaCrawl v3.0: A Large-scale English-Japanese Parallel Corpus

Figure 3 for JParaCrawl v3.0: A Large-scale English-Japanese Parallel Corpus

Figure 4 for JParaCrawl v3.0: A Large-scale English-Japanese Parallel Corpus

Abstract:Most current machine translation models are mainly trained with parallel corpora, and their translation accuracy largely depends on the quality and quantity of the corpora. Although there are billions of parallel sentences for a few language pairs, effectively dealing with most language pairs is difficult due to a lack of publicly available parallel corpora. This paper creates a large parallel corpus for English-Japanese, a language pair for which only limited resources are available, compared to such resource-rich languages as English-German. It introduces a new web-based English-Japanese parallel corpus named JParaCrawl v3.0. Our new corpus contains more than 21 million unique parallel sentence pairs, which is more than twice as many as the previous JParaCrawl v2.0 corpus. Through experiments, we empirically show how our new corpus boosts the accuracy of machine translation models on various domains. The JParaCrawl v3.0 corpus will eventually be publicly available online for research purposes.

* 7 pages

Via

Access Paper or Ask Questions

Bilingual Text Extraction as Reading Comprehension

Apr 29, 2020

Katsuki Chousa, Masaaki Nagata, Masaaki Nishino

Figure 1 for Bilingual Text Extraction as Reading Comprehension

Figure 2 for Bilingual Text Extraction as Reading Comprehension

Figure 3 for Bilingual Text Extraction as Reading Comprehension

Figure 4 for Bilingual Text Extraction as Reading Comprehension

Abstract:In this paper, we propose a method to extract bilingual texts automatically from noisy parallel corpora by framing the problem as a token-level span prediction, such as SQuAD-style Reading Comprehension. To extract a span of the target document that is a translation of a given source sentence (span), we use either QANet or multilingual BERT. QANet can be trained for a specific parallel corpus from scratch, while multilingual BERT can utilize pre-trained multilingual representations. For the span prediction method using QANet, we introduce a total optimization method using integer linear programming to achieve consistency in the predicted parallel spans. We conduct a parallel sentence extraction experiment using simulated noisy parallel corpora with two language pairs (En-Fr and En-Ja) and find that the proposed method using QANet achieves significantly better accuracy than a baseline method using two bi-directional RNN encoders, particularly for distant language pairs (En-Ja). We also conduct a sentence alignment experiment using En-Ja newspaper articles and find that the proposed method using multilingual BERT achieves significantly better accuracy than a baseline method using a bilingual dictionary and dynamic programming.

* 7 pages

Via

Access Paper or Ask Questions