Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages

Jun 22, 2023

Mingda Chen, Kevin Heffernan, Onur Çelebi, Alex Mourachko, Holger Schwenk

Figure 1 for xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages

Figure 2 for xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages

Figure 3 for xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages

Figure 4 for xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages

Share this with someone who'll enjoy it:

Abstract:We introduce a new proxy score for evaluating bitext mining based on similarity in a multilingual embedding space: xSIM++. In comparison to xSIM, this improved proxy leverages rule-based approaches to extend English sentences in any evaluation set with synthetic, hard-to-distinguish examples which more closely mirror the scenarios we encounter during large-scale mining. We validate this proxy by running a significant number of bitext mining experiments for a set of low-resource languages, and subsequently train NMT systems on the mined data. In comparison to xSIM, we show that xSIM++ is better correlated with the downstream BLEU scores of translation systems trained on mined bitexts, providing a reliable proxy of bitext mining performance without needing to run expensive bitext mining pipelines. xSIM++ also reports performance for different error types, offering more fine-grained feedback for model development.

* The first two authors contributed equally; ACL 2023 short; Code and data are available at https://github.com/facebookresearch/LASER

View paper on

Share this with someone who'll enjoy it:

Title:xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages

Paper and Code