Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Data Selection with Cluster-Based Language Difference Models and Cynical Selection

Apr 09, 2019

Lucía Santamaría, Amittai Axelrod

Figure 1 for Data Selection with Cluster-Based Language Difference Models and Cynical Selection

Figure 2 for Data Selection with Cluster-Based Language Difference Models and Cynical Selection

Figure 3 for Data Selection with Cluster-Based Language Difference Models and Cynical Selection

Figure 4 for Data Selection with Cluster-Based Language Difference Models and Cynical Selection

Share this with someone who'll enjoy it:

Abstract:We present and apply two methods for addressing the problem of selecting relevant training data out of a general pool for use in tasks such as machine translation. Building on existing work on class-based language difference models, we first introduce a cluster-based method that uses Brown clusters to condense the vocabulary of the corpora. Secondly, we implement the cynical data selection method, which incrementally constructs a training corpus to efficiently model the task corpus. Both the cluster-based and the cynical data selection approaches are used for the first time within a machine translation system, and we perform a head-to-head comparison. Our intrinsic evaluations show that both new methods outperform the standard Moore-Lewis approach (cross-entropy difference), in terms of better perplexity and OOV rates on in-domain data. The cynical approach converges much quicker, covering nearly all of the in-domain vocabulary with 84% less data than the other methods. Furthermore, the new approaches can be used to select machine translation training data for training better systems. Our results confirm that class-based selection using Brown clusters is a viable alternative to POS-based class-based methods, and removes the reliance on a part-of-speech tagger. Additionally, we are able to validate the recently proposed cynical data selection method, showing that its performance in SMT models surpasses that of traditional cross-entropy difference methods and more closely matches the sentence length of the task corpus.

* Proceedings of the International Workshop on Spoken Language Translation (IWSLT) 2017 * 9 pages, 7 figures, IWSLT 2017

View paper on

Share this with someone who'll enjoy it:

Title:Data Selection with Cluster-Based Language Difference Models and Cynical Selection

Paper and Code