Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sourabh Deoghare

Giving the Old a Fresh Spin: Quality Estimation-Assisted Constrained Decoding for Automatic Post-Editing

Jan 28, 2025

Sourabh Deoghare, Diptesh Kanojia, Pushpak Bhattacharyya

Abstract:Automatic Post-Editing (APE) systems often struggle with over-correction, where unnecessary modifications are made to a translation, diverging from the principle of minimal editing. In this paper, we propose a novel technique to mitigate over-correction by incorporating word-level Quality Estimation (QE) information during the decoding process. This method is architecture-agnostic, making it adaptable to any APE system, regardless of the underlying model or training approach. Our experiments on English-German, English-Hindi, and English-Marathi language pairs show the proposed approach yields significant improvements over their corresponding baseline APE systems, with TER gains of $0.65$, $1.86$, and $1.44$ points, respectively. These results underscore the complementary relationship between QE and APE tasks and highlight the effectiveness of integrating QE information to reduce over-correction in APE systems.

* Accepted to NAACL 2025 Main Conference: Short Papers

Via

Access Paper or Ask Questions

Together We Can: Multilingual Automatic Post-Editing for Low-Resource Languages

Oct 23, 2024

Sourabh Deoghare, Diptesh Kanojia, Pushpak Bhattacharyya

Figure 1 for Together We Can: Multilingual Automatic Post-Editing for Low-Resource Languages

Figure 2 for Together We Can: Multilingual Automatic Post-Editing for Low-Resource Languages

Figure 3 for Together We Can: Multilingual Automatic Post-Editing for Low-Resource Languages

Figure 4 for Together We Can: Multilingual Automatic Post-Editing for Low-Resource Languages

Abstract:This exploratory study investigates the potential of multilingual Automatic Post-Editing (APE) systems to enhance the quality of machine translations for low-resource Indo-Aryan languages. Focusing on two closely related language pairs, English-Marathi and English-Hindi, we exploit the linguistic similarities to develop a robust multilingual APE model. To facilitate cross-linguistic transfer, we generate synthetic Hindi-Marathi and Marathi-Hindi APE triplets. Additionally, we incorporate a Quality Estimation (QE)-APE multi-task learning framework. While the experimental results underline the complementary nature of APE and QE, we also observe that QE-APE multitask learning facilitates effective domain adaptation. Our experiments demonstrate that the multilingual APE models outperform their corresponding English-Hindi and English-Marathi single-pair models by $2.5$ and $2.39$ TER points, respectively, with further notable improvements over the multilingual APE model observed through multi-task learning ($+1.29$ and $+1.44$ TER points), data augmentation ($+0.53$ and $+0.45$ TER points) and domain adaptation ($+0.35$ and $+0.45$ TER points). We release the synthetic data, code, and models accrued during this study publicly at https://github.com/cfiltnlp/Multilingual-APE.

* Accepted at Findings of EMNLP 2024

Via

Access Paper or Ask Questions

APE-then-QE: Correcting then Filtering Pseudo Parallel Corpora for MT Training Data Creation

Dec 18, 2023

Akshay Batheja, Sourabh Deoghare, Diptesh Kanojia, Pushpak Bhattacharyya

Abstract:Automatic Post-Editing (APE) is the task of automatically identifying and correcting errors in the Machine Translation (MT) outputs. We propose a repair-filter-use methodology that uses an APE system to correct errors on the target side of the MT training data. We select the sentence pairs from the original and corrected sentence pairs based on the quality scores computed using a Quality Estimation (QE) model. To the best of our knowledge, this is a novel adaptation of APE and QE to extract quality parallel corpus from the pseudo-parallel corpus. By training with this filtered corpus, we observe an improvement in the Machine Translation system's performance by 5.64 and 9.91 BLEU points, for English-Marathi and Marathi-English, over the baseline model. The baseline model is the one that is trained on the whole pseudo-parallel corpus. Our work is not limited by the characteristics of English or Marathi languages; and is language pair-agnostic, given the necessary QE and APE data.

* arXiv admin note: text overlap with arXiv:2306.03507

Via

Access Paper or Ask Questions

VAKTA-SETU: A Speech-to-Speech Machine Translation Service in Select Indic Languages

May 21, 2023

Shivam Mhaskar, Vineet Bhat, Akshay Batheja, Sourabh Deoghare, Paramveer Choudhary, Pushpak Bhattacharyya

Figure 1 for VAKTA-SETU: A Speech-to-Speech Machine Translation Service in Select Indic Languages

Figure 2 for VAKTA-SETU: A Speech-to-Speech Machine Translation Service in Select Indic Languages

Figure 3 for VAKTA-SETU: A Speech-to-Speech Machine Translation Service in Select Indic Languages

Figure 4 for VAKTA-SETU: A Speech-to-Speech Machine Translation Service in Select Indic Languages

Abstract:In this work, we present our deployment-ready Speech-to-Speech Machine Translation (SSMT) system for English-Hindi, English-Marathi, and Hindi-Marathi language pairs. We develop the SSMT system by cascading Automatic Speech Recognition (ASR), Disfluency Correction (DC), Machine Translation (MT), and Text-to-Speech Synthesis (TTS) models. We discuss the challenges faced during the research and development stage and the scalable deployment of the SSMT system as a publicly accessible web service. On the MT part of the pipeline too, we create a Text-to-Text Machine Translation (TTMT) service in all six translation directions involving English, Hindi, and Marathi. To mitigate data scarcity, we develop a LaBSE-based corpus filtering tool to select high-quality parallel sentences from a noisy pseudo-parallel corpus for training the TTMT system. All the data used for training the SSMT and TTMT systems and the best models are being made publicly available. Users of our system are (a) Govt. of India in the context of its new education policy (NEP), (b) tourists who criss-cross the multilingual landscape of India, (c) Indian Judiciary where a leading cause of the pendency of cases (to the order of 10 million as on date) is the translation of case papers, (d) farmers who need weather and price information and so on. We also share the feedback received from various stakeholders when our SSMT and TTMT systems were demonstrated in large public events.

Via

Access Paper or Ask Questions