Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation

Jun 03, 2024

Yongjing Yin, Jiali Zeng, Yafu Li, Fandong Meng, Yue Zhang

Figure 1 for LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation

Figure 2 for LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation

Figure 3 for LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation

Figure 4 for LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation

Share this with someone who'll enjoy it:

Abstract:The fine-tuning of open-source large language models (LLMs) for machine translation has recently received considerable attention, marking a shift towards data-centric research from traditional neural machine translation. However, the area of data collection for instruction fine-tuning in machine translation remains relatively underexplored. In this paper, we present LexMatcher, a simple yet effective method for data collection that leverages bilingual dictionaries to generate a dataset, the design of which is driven by the coverage of senses found in these dictionaries. The dataset comprises a subset retrieved from an existing corpus and a smaller synthesized subset which supplements the infrequent senses of polysemous words. Utilizing LLaMA2 as our base model, our approach outperforms the established baselines on the WMT2022 test sets and also exhibits significant performance improvements in tasks related to word sense disambiguation and specialized terminology translation. These results underscore the effectiveness of LexMatcher in enhancing LLM-based machine translation.

View paper on

Share this with someone who'll enjoy it:

Title:LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation

Paper and Code