Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jan Buts

Using Multiple Subwords to Improve English-Esperanto Automated Literary Translation Quality

Nov 28, 2020

Alberto Poncelas, Jan Buts, James Hadley, Andy Way

Figure 1 for Using Multiple Subwords to Improve English-Esperanto Automated Literary Translation Quality

Figure 2 for Using Multiple Subwords to Improve English-Esperanto Automated Literary Translation Quality

Figure 3 for Using Multiple Subwords to Improve English-Esperanto Automated Literary Translation Quality

Figure 4 for Using Multiple Subwords to Improve English-Esperanto Automated Literary Translation Quality

Abstract:Building Machine Translation (MT) systems for low-resource languages remains challenging. For many language pairs, parallel data are not widely available, and in such cases MT models do not achieve results comparable to those seen with high-resource languages. When data are scarce, it is of paramount importance to make optimal use of the limited material available. To that end, in this paper we propose employing the same parallel sentences multiple times, only changing the way the words are split each time. For this purpose we use several Byte Pair Encoding models, with various merge operations used in their configuration. In our experiments, we use this technique to expand the available data and improve an MT system involving a low-resource language pair, namely English-Esperanto. As an additional contribution, we made available a set of English-Esperanto parallel data in the literary domain.

* The 3rd Workshop on Technologies for MT of Low Resource Languages (LoResMT 2020)

Via

Access Paper or Ask Questions

A Tool for Facilitating OCR Postediting in Historical Documents

Apr 23, 2020

Alberto Poncelas, Mohammad Aboomar, Jan Buts, James Hadley, Andy Way

Figure 1 for A Tool for Facilitating OCR Postediting in Historical Documents

Figure 2 for A Tool for Facilitating OCR Postediting in Historical Documents

Abstract:Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a Language Model (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom (Cary ,1719). As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is also transparent and subject to human intervention.

* Workshop on Language Technologies for Historical and Ancient Languages, LT4HALA (2020)

Via

Access Paper or Ask Questions