Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Weston Feely

Inference-only sub-character decomposition improves translation of unseen logographic characters

Nov 12, 2020

Danielle Saunders, Weston Feely, Bill Byrne

Figure 1 for Inference-only sub-character decomposition improves translation of unseen logographic characters

Figure 2 for Inference-only sub-character decomposition improves translation of unseen logographic characters

Figure 3 for Inference-only sub-character decomposition improves translation of unseen logographic characters

Figure 4 for Inference-only sub-character decomposition improves translation of unseen logographic characters

Abstract:Neural Machine Translation (NMT) on logographic source languages struggles when translating `unseen' characters, which never appear in the training data. One possible approach to this problem uses sub-character decomposition for training and test sentences. However, this approach involves complete retraining, and its effectiveness for unseen character translation to non-logographic languages has not been fully explored. We investigate existing ideograph-based sub-character decomposition approaches for Chinese-to-English and Japanese-to-English NMT, for both high-resource and low-resource domains. For each language pair and domain we construct a test set where all source sentences contain at least one unseen logographic character. We find that complete sub-character decomposition often harms unseen character translation, and gives inconsistent results generally. We offer a simple alternative based on decomposition before inference for unseen characters only. Our approach allows flexible application, achieving translation adequacy improvements and requiring no additional models or training.

* Workshop on Asian Translation (WAT) 2020

Via

Access Paper or Ask Questions