Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise on Machine Translation

Jul 02, 2024

Yan Meng, Di Wu, Christof Monz

Figure 1 for How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise on Machine Translation

Figure 2 for How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise on Machine Translation

Figure 3 for How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise on Machine Translation

Figure 4 for How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise on Machine Translation

Share this with someone who'll enjoy it:

Abstract:The massive amounts of web-mined parallel data contain large amounts of noise. Semantic misalignment, as the primary source of the noise, poses a challenge for training machine translation systems. In this paper, we first study the impact of real-world hard-to-detect misalignment noise by proposing a process to simulate the realistic misalignment controlled by semantic similarity. After quantitatively analyzing the impact of simulated misalignment on machine translation, we show the limited effectiveness of widely used pre-filters to improve the translation performance, underscoring the necessity of more fine-grained ways to handle data noise. By observing the increasing reliability of the model's self-knowledge for distinguishing misaligned and clean data at the token-level, we propose a self-correction approach which leverages the model's prediction distribution to revise the training supervision from the ground-truth data over training time. Through comprehensive experiments, we show that our self-correction method not only improves translation performance in the presence of simulated misalignment noise but also proves effective for real-world noisy web-mined datasets across eight translation tasks.

View paper on

Share this with someone who'll enjoy it:

Title:How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise on Machine Translation

Paper and Code