Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jeremy Gwinnup

The Case for Evaluating Multimodal Translation Models on Text Datasets

Mar 05, 2024

Vipin Vijayan, Braeden Bowen, Scott Grigsby, Timothy Anderson, Jeremy Gwinnup

Figure 1 for The Case for Evaluating Multimodal Translation Models on Text Datasets

Figure 2 for The Case for Evaluating Multimodal Translation Models on Text Datasets

Abstract:A good evaluation framework should evaluate multimodal machine translation (MMT) models by measuring 1) their use of visual information to aid in the translation task and 2) their ability to translate complex sentences such as done for text-only machine translation. However, most current work in MMT is evaluated against the Multi30k testing sets, which do not measure these properties. Namely, the use of visual information by the MMT model cannot be shown directly from the Multi30k test set results and the sentences in Multi30k are are image captions, i.e., short, descriptive sentences, as opposed to complex sentences that typical text-only machine translation models are evaluated against. Therefore, we propose that MMT models be evaluated using 1) the CoMMuTE evaluation framework, which measures the use of visual information by MMT models, 2) the text-only WMT news translation task test sets, which evaluates translation performance against complex sentences, and 3) the Multi30k test sets, for measuring MMT model performance against a real MMT dataset. Finally, we evaluate recent MMT models trained solely against the Multi30k dataset against our proposed evaluation framework and demonstrate the dramatic drop performance against text-only testing sets compared to recent text-only MT models.

Via

Access Paper or Ask Questions

Adding Multimodal Capabilities to a Text-only Translation Model

Mar 05, 2024

Vipin Vijayan, Braeden Bowen, Scott Grigsby, Timothy Anderson, Jeremy Gwinnup

Abstract:While most current work in multimodal machine translation (MMT) uses the Multi30k dataset for training and evaluation, we find that the resulting models overfit to the Multi30k dataset to an extreme degree. Consequently, these models perform very badly when evaluated against typical text-only testing sets such as the WMT newstest datasets. In order to perform well on both Multi30k and typical text-only datasets, we use a performant text-only machine translation (MT) model as the starting point of our MMT model. We add vision-text adapter layers connected via gating mechanisms to the MT model, and incrementally transform the MT model into an MMT model by 1) pre-training using vision-based masking of the source text and 2) fine-tuning on Multi30k.

Via

Access Paper or Ask Questions

Detecting Concrete Visual Tokens for Multimodal Machine Translation

Mar 05, 2024

Braeden Bowen, Vipin Vijayan, Scott Grigsby, Timothy Anderson, Jeremy Gwinnup

Figure 1 for Detecting Concrete Visual Tokens for Multimodal Machine Translation

Figure 2 for Detecting Concrete Visual Tokens for Multimodal Machine Translation

Figure 3 for Detecting Concrete Visual Tokens for Multimodal Machine Translation

Figure 4 for Detecting Concrete Visual Tokens for Multimodal Machine Translation

Abstract:The challenge of visual grounding and masking in multimodal machine translation (MMT) systems has encouraged varying approaches to the detection and selection of visually-grounded text tokens for masking. We introduce new methods for detection of visually and contextually relevant (concrete) tokens from source sentences, including detection with natural language processing (NLP), detection with object detection, and a joint detection-verification technique. We also introduce new methods for selection of detected tokens, including shortest $n$ tokens, longest $n$ tokens, and all detected concrete tokens. We utilize the GRAM MMT architecture to train models against synthetically collated multimodal datasets of source images with masked sentences, showing performance improvements and improved usage of visual context during translation tasks over the baseline model.

Via

Access Paper or Ask Questions

A Survey of Vision-Language Pre-training from the Lens of Multimodal Machine Translation

Jun 12, 2023

Jeremy Gwinnup, Kevin Duh

Abstract:Large language models such as BERT and the GPT series started a paradigm shift that calls for building general-purpose models via pre-training on large datasets, followed by fine-tuning on task-specific datasets. There is now a plethora of large pre-trained models for Natural Language Processing and Computer Vision. Recently, we have seen rapid developments in the joint Vision-Language space as well, where pre-trained models such as CLIP (Radford et al., 2021) have demonstrated improvements in downstream tasks like image captioning and visual question answering. However, surprisingly there is comparatively little work on exploring these models for the task of multimodal machine translation, where the goal is to leverage image/video modality in text-to-text translation. To fill this gap, this paper surveys the landscape of language-and-vision pre-training from the lens of multimodal machine translation. We summarize the common architectures, pre-training objectives, and datasets from literature and conjecture what further is needed to make progress on multimodal machine translation.

* 10 pages

Via

Access Paper or Ask Questions

Learning When to Say "I Don't Know"

Sep 11, 2022

Nicholas Kashani Motlagh, Jim Davis, Tim Anderson, Jeremy Gwinnup

Figure 1 for Learning When to Say "I Don't Know"

Figure 2 for Learning When to Say "I Don't Know"

Figure 3 for Learning When to Say "I Don't Know"

Figure 4 for Learning When to Say "I Don't Know"

Abstract:We propose a new Reject Option Classification technique to identify and remove regions of uncertainty in the decision space for a given neural classifier and dataset. Such existing formulations employ a learned rejection (remove)/selection (keep) function and require either a known cost for rejecting examples or strong constraints on the accuracy or coverage of the selected examples. We consider an alternative formulation by instead analyzing the complementary reject region and employing a validation set to learn per-class softmax thresholds. The goal is to maximize the accuracy of the selected examples subject to a natural randomness allowance on the rejected examples (rejecting more incorrect than correct predictions). We provide results showing the benefits of the proposed method over na\"ively thresholding calibrated/uncalibrated softmax scores with 2-D points, imagery, and text classification datasets using state-of-the-art pretrained models. Source code is available at https://github.com/osu-cvl/learning-idk.

* International Symposium on Visual Computing, October 2022

Via

Access Paper or Ask Questions

An Empirical Exploration of Curriculum Learning for Neural Machine Translation

Nov 02, 2018

Xuan Zhang, Gaurav Kumar, Huda Khayrallah, Kenton Murray, Jeremy Gwinnup, Marianna J Martindale, Paul McNamee, Kevin Duh, Marine Carpuat

Figure 1 for An Empirical Exploration of Curriculum Learning for Neural Machine Translation

Figure 2 for An Empirical Exploration of Curriculum Learning for Neural Machine Translation

Figure 3 for An Empirical Exploration of Curriculum Learning for Neural Machine Translation

Figure 4 for An Empirical Exploration of Curriculum Learning for Neural Machine Translation

Abstract:Machine translation systems based on deep neural networks are expensive to train. Curriculum learning aims to address this issue by choosing the order in which samples are presented during training to help train better models faster. We adopt a probabilistic view of curriculum learning, which lets us flexibly evaluate the impact of curricula design, and perform an extensive exploration on a German-English translation task. Results show that it is possible to improve convergence time at no loss in translation quality. However, results are highly sensitive to the choice of sample difficulty criteria, curriculum schedule and other hyperparameters.

Via

Access Paper or Ask Questions

Freezing Subnetworks to Analyze Domain Adaptation in Neural Machine Translation

Sep 14, 2018

Brian Thompson, Huda Khayrallah, Antonios Anastasopoulos, Arya McCarthy, Kevin Duh, Rebecca Marvin, Paul McNamee, Jeremy Gwinnup, Tim Anderson, Philipp Koehn

Figure 1 for Freezing Subnetworks to Analyze Domain Adaptation in Neural Machine Translation

Figure 2 for Freezing Subnetworks to Analyze Domain Adaptation in Neural Machine Translation

Figure 3 for Freezing Subnetworks to Analyze Domain Adaptation in Neural Machine Translation

Figure 4 for Freezing Subnetworks to Analyze Domain Adaptation in Neural Machine Translation

Abstract:To better understand the effectiveness of continued training, we analyze the major components of a neural machine translation system (the encoder, decoder, and each embedding space) and consider each component's contribution to, and capacity for, domain adaptation. We find that freezing any single component during continued training has minimal impact on performance, and that performance is surprisingly good when a single component is adapted while holding the rest of the model fixed. We also find that continued training does not move the model very far from the out-of-domain model, compared to a sensitivity analysis metric, suggesting that the out-of-domain model can provide a good generic initialization for the new domain.

* to be presented at WMT 2018

Via

Access Paper or Ask Questions