Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

James Hadley

Using Multiple Subwords to Improve English-Esperanto Automated Literary Translation Quality

Nov 28, 2020

Alberto Poncelas, Jan Buts, James Hadley, Andy Way

Figure 1 for Using Multiple Subwords to Improve English-Esperanto Automated Literary Translation Quality

Figure 2 for Using Multiple Subwords to Improve English-Esperanto Automated Literary Translation Quality

Figure 3 for Using Multiple Subwords to Improve English-Esperanto Automated Literary Translation Quality

Figure 4 for Using Multiple Subwords to Improve English-Esperanto Automated Literary Translation Quality

Abstract:Building Machine Translation (MT) systems for low-resource languages remains challenging. For many language pairs, parallel data are not widely available, and in such cases MT models do not achieve results comparable to those seen with high-resource languages. When data are scarce, it is of paramount importance to make optimal use of the limited material available. To that end, in this paper we propose employing the same parallel sentences multiple times, only changing the way the words are split each time. For this purpose we use several Byte Pair Encoding models, with various merge operations used in their configuration. In our experiments, we use this technique to expand the available data and improve an MT system involving a low-resource language pair, namely English-Esperanto. As an additional contribution, we made available a set of English-Esperanto parallel data in the literary domain.

* The 3rd Workshop on Technologies for MT of Low Resource Languages (LoResMT 2020)

Via

Access Paper or Ask Questions

The Impact of Indirect Machine Translation on Sentiment Classification

Aug 25, 2020

Alberto Poncelas, Pintu Lohar, Andy Way, James Hadley

Figure 1 for The Impact of Indirect Machine Translation on Sentiment Classification

Figure 2 for The Impact of Indirect Machine Translation on Sentiment Classification

Figure 3 for The Impact of Indirect Machine Translation on Sentiment Classification

Figure 4 for The Impact of Indirect Machine Translation on Sentiment Classification

Abstract:Sentiment classification has been crucial for many natural language processing (NLP) applications, such as the analysis of movie reviews, tweets, or customer feedback. A sufficiently large amount of data is required to build a robust sentiment classification system. However, such resources are not always available for all domains or for all languages. In this work, we propose employing a machine translation (MT) system to translate customer feedback into another language to investigate in which cases translated sentences can have a positive or negative impact on an automatic sentiment classifier. Furthermore, as performing a direct translation is not always possible, we explore the performance of automatic classifiers on sentences that have been translated using a pivot MT system. We conduct several experiments using the above approaches to analyse the performance of our proposed sentiment classification system and discuss the advantages and drawbacks of classifying translated sentences.

* Proceedings of Association for Machine Translation in the Americas, AMTA (2020)

Via

Access Paper or Ask Questions

Multiple Segmentations of Thai Sentences for Neural Machine Translation

Apr 23, 2020

Alberto Poncelas, Wichaya Pidchamook, Chao-Hong Liu, James Hadley, Andy Way

Figure 1 for Multiple Segmentations of Thai Sentences for Neural Machine Translation

Figure 2 for Multiple Segmentations of Thai Sentences for Neural Machine Translation

Figure 3 for Multiple Segmentations of Thai Sentences for Neural Machine Translation

Figure 4 for Multiple Segmentations of Thai Sentences for Neural Machine Translation

Abstract:Thai is a low-resource language, so it is often the case that data is not available in sufficient quantities to train an Neural Machine Translation (NMT) model which perform to a high level of quality. In addition, the Thai script does not use white spaces to delimit the boundaries between words, which adds more complexity when building sequence to sequence models. In this work, we explore how to augment a set of English--Thai parallel data by replicating sentence-pairs with different word segmentation methods on Thai, as training data for NMT model training. Using different merge operations of Byte Pair Encoding, different segmentations of Thai sentences can be obtained. The experiments show that combining these datasets, performance is improved for NMT models trained with a dataset that has been split using a supervised splitting tool.

* Spoken Language Technologies for Under-resourced languages and CCURL Collaboration and Computing for Under-Resourced Languages Workshop, SLTU-CCURL (2020)

Via

Access Paper or Ask Questions

A Tool for Facilitating OCR Postediting in Historical Documents

Apr 23, 2020

Alberto Poncelas, Mohammad Aboomar, Jan Buts, James Hadley, Andy Way

Figure 1 for A Tool for Facilitating OCR Postediting in Historical Documents

Figure 2 for A Tool for Facilitating OCR Postediting in Historical Documents

Abstract:Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a Language Model (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom (Cary ,1719). As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is also transparent and subject to human intervention.

* Workshop on Language Technologies for Historical and Ancient Languages, LT4HALA (2020)

Via

Access Paper or Ask Questions