Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yoshinari Fujinuma

M3T: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation

Jun 12, 2024

Benjamin Hsu, Xiaoyu Liu, Huayang Li, Yoshinari Fujinuma, Maria Nadejde, Xing Niu, Yair Kittenplon, Ron Litman, Raghavendra Pappagari

Figure 1 for M3T: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation

Figure 2 for M3T: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation

Figure 3 for M3T: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation

Figure 4 for M3T: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation

Abstract:Document translation poses a challenge for Neural Machine Translation (NMT) systems. Most document-level NMT systems rely on meticulously curated sentence-level parallel data, assuming flawless extraction of text from documents along with their precise reading order. These systems also tend to disregard additional visual cues such as the document layout, deeming it irrelevant. However, real-world documents often possess intricate text layouts that defy these assumptions. Extracting information from Optical Character Recognition (OCR) or heuristic rules can result in errors, and the layout (e.g., paragraphs, headers) may convey relationships between distant sections of text. This complexity is particularly evident in widely used PDF documents, which represent information visually. This paper addresses this gap by introducing M3T, a novel benchmark dataset tailored to evaluate NMT systems on the comprehensive task of translating semi-structured documents. This dataset aims to bridge the evaluation gap in document-level NMT systems, acknowledging the challenges posed by rich text layouts in real-world applications.

* NAACL 2024, dataset at https://github.com/amazon-science/m3t-multi-modal-translation-bench

Via

Access Paper or Ask Questions

A Multi-Modal Multilingual Benchmark for Document Image Classification

Oct 25, 2023

Yoshinari Fujinuma, Siddharth Varia, Nishant Sankaran, Srikar Appalaraju, Bonan Min, Yogarshi Vyas

Figure 1 for A Multi-Modal Multilingual Benchmark for Document Image Classification

Figure 2 for A Multi-Modal Multilingual Benchmark for Document Image Classification

Figure 3 for A Multi-Modal Multilingual Benchmark for Document Image Classification

Figure 4 for A Multi-Modal Multilingual Benchmark for Document Image Classification

Abstract:Document image classification is different from plain-text document classification and consists of classifying a document by understanding the content and structure of documents such as forms, emails, and other such documents. We show that the only existing dataset for this task (Lewis et al., 2006) has several limitations and we introduce two newly curated multilingual datasets WIKI-DOC and MULTIEURLEX-DOC that overcome these limitations. We further undertake a comprehensive study of popular visually-rich document understanding or Document AI models in previously untested setting in document image classification such as 1) multi-label classification, and 2) zero-shot cross-lingual transfer setup. Experimental results show limitations of multilingual Document AI models on cross-lingual transfer across typologically distant languages. Our datasets and findings open the door for future research into improving Document AI models.

* Accepted to EMNLP 2023 (Findings)

Via

Access Paper or Ask Questions

Diable: Efficient Dialogue State Tracking as Operations on Tables

May 26, 2023

Pietro Lesci, Yoshinari Fujinuma, Momchil Hardalov, Chao Shang, Lluis Marquez

Abstract:Sequence-to-sequence state-of-the-art systems for dialogue state tracking (DST) use the full dialogue history as input, represent the current state as a list with all the slots, and generate the entire state from scratch at each dialogue turn. This approach is inefficient, especially when the number of slots is large and the conversation is long. In this paper, we propose Diable, a new task formalisation that simplifies the design and implementation of efficient DST systems and allows one to easily plug and play large language models. We represent the dialogue state as a table and formalise DST as a table manipulation task. At each turn, the system updates the previous state by generating table operations based on the dialogue context. Extensive experimentation on the MultiWoz datasets demonstrates that Diable (i) outperforms strong efficient DST baselines, (ii) is 2.4x more time efficient than current state-of-the-art methods while retaining competitive Joint Goal Accuracy, and (iii) is robust to noisy data annotations due to the table operations approach.

* Accepted to ACL 2023 (Findings)

Via

Access Paper or Ask Questions

Comparing Biases and the Impact of Multilingual Training across Multiple Languages

May 18, 2023

Sharon Levy, Neha Anna John, Ling Liu, Yogarshi Vyas, Jie Ma, Yoshinari Fujinuma, Miguel Ballesteros, Vittorio Castelli, Dan Roth

Figure 1 for Comparing Biases and the Impact of Multilingual Training across Multiple Languages

Figure 2 for Comparing Biases and the Impact of Multilingual Training across Multiple Languages

Figure 3 for Comparing Biases and the Impact of Multilingual Training across Multiple Languages

Figure 4 for Comparing Biases and the Impact of Multilingual Training across Multiple Languages

Abstract:Studies in bias and fairness in natural language processing have primarily examined social biases within a single language and/or across few attributes (e.g. gender, race). However, biases can manifest differently across various languages for individual attributes. As a result, it is critical to examine biases within each language and attribute. Of equal importance is to study how these biases compare across languages and how the biases are affected when training a model on multilingual data versus monolingual data. We present a bias analysis across Italian, Chinese, English, Hebrew, and Spanish on the downstream sentiment analysis task to observe whether specific demographics are viewed more positively. We study bias similarities and differences across these languages and investigate the impact of multilingual vs. monolingual training data. We adapt existing sentiment bias templates in English to Italian, Chinese, Hebrew, and Spanish for four attributes: race, religion, nationality, and gender. Our results reveal similarities in bias expression such as favoritism of groups that are dominant in each language's culture (e.g. majority religions and nationalities). Additionally, we find an increased variation in predictions across protected groups, indicating bias amplification, after multilingual finetuning in comparison to multilingual pretraining.

Via

Access Paper or Ask Questions

Match the Script, Adapt if Multilingual: Analyzing the Effect of Multilingual Pretraining on Cross-lingual Transferability

Mar 21, 2022

Yoshinari Fujinuma, Jordan Boyd-Graber, Katharina Kann

Figure 1 for Match the Script, Adapt if Multilingual: Analyzing the Effect of Multilingual Pretraining on Cross-lingual Transferability

Figure 2 for Match the Script, Adapt if Multilingual: Analyzing the Effect of Multilingual Pretraining on Cross-lingual Transferability

Figure 3 for Match the Script, Adapt if Multilingual: Analyzing the Effect of Multilingual Pretraining on Cross-lingual Transferability

Figure 4 for Match the Script, Adapt if Multilingual: Analyzing the Effect of Multilingual Pretraining on Cross-lingual Transferability

Abstract:Pretrained multilingual models enable zero-shot learning even for unseen languages, and that performance can be further improved via adaptation prior to finetuning. However, it is unclear how the number of pretraining languages influences a model's zero-shot learning for languages unseen during pretraining. To fill this gap, we ask the following research questions: (1) How does the number of pretraining languages influence zero-shot performance on unseen target languages? (2) Does the answer to that question change with model adaptation? (3) Do the findings for our first question change if the languages used for pretraining are all related? Our experiments on pretraining with related languages indicate that choosing a diverse set of languages is crucial. Without model adaptation, surprisingly, increasing the number of pretraining languages yields better results up to adding related languages, after which performance plateaus. In contrast, with model adaptation via continued pretraining, pretraining on a larger number of languages often gives further improvement, suggesting that model adaptation is crucial to exploit additional pretraining languages.

* ACL 2022 camera ready

Via

Access Paper or Ask Questions

Semi-Supervised Joint Estimation of Word and Document Readability

Apr 27, 2021

Yoshinari Fujinuma, Masato Hagiwara

Figure 1 for Semi-Supervised Joint Estimation of Word and Document Readability

Figure 2 for Semi-Supervised Joint Estimation of Word and Document Readability

Figure 3 for Semi-Supervised Joint Estimation of Word and Document Readability

Figure 4 for Semi-Supervised Joint Estimation of Word and Document Readability

Abstract:Readability or difficulty estimation of words and documents has been investigated independently in the literature, often assuming the existence of extensive annotated resources for the other. Motivated by our analysis showing that there is a recursive relationship between word and document difficulty, we propose to jointly estimate word and document difficulty through a graph convolutional network (GCN) in a semi-supervised fashion. Our experimental results reveal that the GCN-based method can achieve higher accuracy than strong baselines, and stays robust even with a smaller amount of labeled data.

Via

Access Paper or Ask Questions

Why Overfitting Isn't Always Bad: Retrofitting Cross-Lingual Word Embeddings to Dictionaries

May 01, 2020

Mozhi Zhang, Yoshinari Fujinuma, Michael J. Paul, Jordan Boyd-Graber

Figure 1 for Why Overfitting Isn't Always Bad: Retrofitting Cross-Lingual Word Embeddings to Dictionaries

Figure 2 for Why Overfitting Isn't Always Bad: Retrofitting Cross-Lingual Word Embeddings to Dictionaries

Figure 3 for Why Overfitting Isn't Always Bad: Retrofitting Cross-Lingual Word Embeddings to Dictionaries

Abstract:Cross-lingual word embeddings (CLWE) are often evaluated on bilingual lexicon induction (BLI). Recent CLWE methods use linear projections, which underfit the training dictionary, to generalize on BLI. However, underfitting can hinder generalization to other downstream tasks that rely on words from the training dictionary. We address this limitation by retrofitting CLWE to the training dictionary, which pulls training translation pairs closer in the embedding space and overfits the training dictionary. This simple post-processing step often improves accuracy on two downstream tasks, despite lowering BLI test accuracy. We also retrofit to both the training dictionary and a synthetic dictionary induced from CLWE, which sometimes generalizes even better on downstream tasks. Our results confirm the importance of fully exploiting training dictionary in downstream tasks and explains why BLI is a flawed CLWE evaluation.

* ACL 2020

Via

Access Paper or Ask Questions

A Resource-Free Evaluation Metric for Cross-Lingual Word Embeddings Based on Graph Modularity

Jun 05, 2019

Yoshinari Fujinuma, Jordan Boyd-Graber, Michael J. Paul

Figure 1 for A Resource-Free Evaluation Metric for Cross-Lingual Word Embeddings Based on Graph Modularity

Figure 2 for A Resource-Free Evaluation Metric for Cross-Lingual Word Embeddings Based on Graph Modularity

Figure 3 for A Resource-Free Evaluation Metric for Cross-Lingual Word Embeddings Based on Graph Modularity

Figure 4 for A Resource-Free Evaluation Metric for Cross-Lingual Word Embeddings Based on Graph Modularity

Abstract:Cross-lingual word embeddings encode the meaning of words from different languages into a shared low-dimensional space. An important requirement for many downstream tasks is that word similarity should be independent of language - i.e., word vectors within one language should not be more similar to each other than to words in another language. We measure this characteristic using modularity, a network measurement that measures the strength of clusters in a graph. Modularity has a moderate to strong correlation with three downstream tasks, even though modularity is based only on the structure of embeddings and does not require any external resources. We show through experiments that modularity can serve as an intrinsic validation metric to improve unsupervised cross-lingual word embeddings, particularly on distant language pairs in low-resource settings.

* Accepted to ACL 2019, camera-ready

Via

Access Paper or Ask Questions

Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification

Dec 22, 2018

Mozhi Zhang, Yoshinari Fujinuma, Jordan Boyd-Graber

Figure 1 for Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification

Figure 2 for Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification

Figure 3 for Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification

Figure 4 for Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification

Abstract:Text classification must sometimes be applied in situations with no training data in a target language. However, training data may be available in a related language. We introduce a cross-lingual document classification framework (CACO) between related language pairs. To best use limited training data, our transfer learning scheme exploits cross-lingual subword similarity by jointly training a character-based embedder and a word-based classifier. The embedder derives vector representations for input words from their written forms, and the classifier makes predictions based on the word vectors. We use a joint character representation for both the source language and the target language, which allows the embedder to generalize knowledge about source language words to target language words with similar forms. We propose a multi-task objective that can further improve the model if additional cross-lingual or monolingual resources are available. CACO models trained under low-resource settings rival cross-lingual word embedding models trained under high-resource settings on related language pairs.

Via

Access Paper or Ask Questions