Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alexis Mathey

Robust Domain Adaptation for Pre-trained Multilingual Neural Machine Translation Models

Oct 26, 2022

Mathieu Grosso, Pirashanth Ratnamogan, Alexis Mathey, William Vanhuffel, Michael Fotso Fotso

Abstract:Recent literature has demonstrated the potential of multilingual Neural Machine Translation (mNMT) models. However, the most efficient models are not well suited to specialized industries. In these cases, internal data is scarce and expensive to find in all language pairs. Therefore, fine-tuning a mNMT model on a specialized domain is hard. In this context, we decided to focus on a new task: Domain Adaptation of a pre-trained mNMT model on a single pair of language while trying to maintain model quality on generic domain data for all language pairs. The risk of loss on generic domain and on other pairs is high. This task is key for mNMT model adoption in the industry and is at the border of many others. We propose a fine-tuning procedure for the generic mNMT that combines embeddings freezing and adversarial loss. Our experiments demonstrated that the procedure improves performances on specialized data with a minimal loss in initial performances on generic domain for all languages pairs, compared to a naive standard approach (+10.0 BLEU score on specialized data, -0.01 to -0.5 BLEU on WMT and Tatoeba datasets on the other pairs with M2M100).

* Accepted by EMNLP 2022 MMNLU Workshop

Via

Access Paper or Ask Questions

Information Extraction from Visually Rich Documents with Font Style Embeddings

Nov 07, 2021

Ismail Oussaid, William Vanhuffel, Pirashanth Ratnamogan, Mhamed Hajaiej, Alexis Mathey, Thomas Gilles

Figure 1 for Information Extraction from Visually Rich Documents with Font Style Embeddings

Figure 2 for Information Extraction from Visually Rich Documents with Font Style Embeddings

Figure 3 for Information Extraction from Visually Rich Documents with Font Style Embeddings

Figure 4 for Information Extraction from Visually Rich Documents with Font Style Embeddings

Abstract:Information extraction (IE) from documents is an intensive area of research with a large set of industrial applications. Current state-of-the-art methods focus on scanned documents with approaches combining computer vision, natural language processing and layout representation. We propose to challenge the usage of computer vision in the case where both token style and visual representation are available (i.e native PDF documents). Our experiments on three real-world complex datasets demonstrate that using token style attributes based embedding instead of a raw visual embedding in LayoutLM model is beneficial. Depending on the dataset, such an embedding yields an improvement of 0.18% to 2.29% in the weighted F1-score with a decrease of 30.7% in the final number of trainable parameters of the model, leading to an improvement in both efficiency and effectiveness.

Via

Access Paper or Ask Questions