Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation

Jul 31, 2023

Haiyue Song, Raj Dabre, Chenhui Chu, Sadao Kurohashi, Eiichiro Sumita

Figure 1 for SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation

Figure 2 for SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation

Figure 3 for SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation

Figure 4 for SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation

Share this with someone who'll enjoy it:

Abstract:Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT). Existing work has shown that neural sub-word segmenters are better than Byte-Pair Encoding (BPE), however, they are inefficient as they require parallel corpora, days to train and hours to decode. This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method that is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora. SelfSeg takes as input a word in the form of a partially masked character sequence, optimizes the word generation probability and generates the segmentation with the maximum posterior probability, which is calculated using a dynamic programming algorithm. The training time of SelfSeg depends on word frequencies, and we explore several word frequency normalization strategies to accelerate the training phase. Additionally, we propose a regularization mechanism that allows the segmenter to generate various segmentations for one word. To show the effectiveness of our approach, we conduct MT experiments in low-, middle- and high-resource scenarios, where we compare the performance of using different segmentation methods. The experimental results demonstrate that on the low-resource ALT dataset, our method achieves more than 1.2 BLEU score improvement compared with BPE and SentencePiece, and a 1.1 score improvement over Dynamic Programming Encoding (DPE) and Vocabulary Learning via Optimal Transport (VOLT) on average. The regularization method achieves approximately a 4.3 BLEU score improvement over BPE and a 1.2 BLEU score improvement over BPE-dropout, the regularized version of BPE. We also observed significant improvements on IWSLT15 Vi->En, WMT16 Ro->En and WMT15 Fi->En datasets, and competitive results on the WMT14 De->En and WMT14 Fr->En datasets.

* Accepted to TALLIP journal

View paper on

Share this with someone who'll enjoy it:

Title:SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation

Paper and Code