Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tetsuji Nakagawa

LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation

Feb 16, 2023

Zhuoyuan Mao, Tetsuji Nakagawa

Figure 1 for LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation

Figure 2 for LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation

Figure 3 for LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation

Figure 4 for LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation

Abstract:Large-scale language-agnostic sentence embedding models such as LaBSE (Feng et al., 2022) obtain state-of-the-art performance for parallel sentence alignment. However, these large-scale models can suffer from inference speed and computation overhead. This study systematically explores learning language-agnostic sentence embeddings with lightweight models. We demonstrate that a thin-deep encoder can construct robust low-dimensional sentence embeddings for 109 languages. With our proposed distillation methods, we achieve further improvements by incorporating knowledge from a teacher model. Empirical results on Tatoeba, United Nations, and BUCC show the effectiveness of our lightweight models. We release our lightweight language-agnostic sentence embedding models LEALLA on TensorFlow Hub.

* EACL 2023 main conference; LEALLA models: https://tfhub.dev/google/collections/LEALLA

Via

Access Paper or Ask Questions

Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection

Aug 31, 2018

Wei Wang, Taro Watanabe, Macduff Hughes, Tetsuji Nakagawa, Ciprian Chelba

Figure 1 for Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection

Figure 2 for Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection

Figure 3 for Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection

Figure 4 for Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection

Abstract:Measuring domain relevance of data and identifying or selecting well-fit domain data for machine translation (MT) is a well-studied topic, but denoising is not yet. Denoising is concerned with a different type of data quality and tries to reduce the negative impact of data noise on MT training, in particular, neural MT (NMT) training. This paper generalizes methods for measuring and selecting data for domain MT and applies them to denoising NMT training. The proposed approach uses trusted data and a denoising curriculum realized by online data selection. Intrinsic and extrinsic evaluations of the approach show its significant effectiveness for NMT to train on data with severe noise.

* 11 pages, 2018 Third Conference on Machine Translation (WMT18)

Via

Access Paper or Ask Questions