Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Garba Aliyu

Quantity vs. Quality of Monolingual Source Data in Automatic Text Translation: Can It Be Too Little If It Is Too Good?

Oct 17, 2024

Idris Abdulmumin, Bashir Shehu Galadanci, Garba Aliyu, Shamsuddeen Hassan Muhammad

Figure 1 for Quantity vs. Quality of Monolingual Source Data in Automatic Text Translation: Can It Be Too Little If It Is Too Good?

Figure 2 for Quantity vs. Quality of Monolingual Source Data in Automatic Text Translation: Can It Be Too Little If It Is Too Good?

Figure 3 for Quantity vs. Quality of Monolingual Source Data in Automatic Text Translation: Can It Be Too Little If It Is Too Good?

Figure 4 for Quantity vs. Quality of Monolingual Source Data in Automatic Text Translation: Can It Be Too Little If It Is Too Good?

Abstract:Monolingual data, being readily available in large quantities, has been used to upscale the scarcely available parallel data to train better models for automatic translation. Self-learning, where a model is made to learn from its output, is one approach to exploit such data. However, it has been shown that too much of this data can be detrimental to the performance of the model if the available parallel data is comparatively extremely low. In this study, we investigate whether the monolingual data can also be too little and if this reduction, based on quality, has any effect on the performance of the translation model. Experiments have shown that on English-German low-resource NMT, it is often better to select only the most useful additional data, based on quality or closeness to the domain of the test data, than utilizing all of the available data.

Via

Access Paper or Ask Questions