Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Elizaveta Korotkova

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

Sep 06, 2024

Pavel Chizhov, Catherine Arnett, Elizaveta Korotkova, Ivan P. Yamshchikov

Abstract:Language models can largely benefit from efficient tokenization. However, they still mostly utilize the classical BPE algorithm, a simple and reliable method. This has been shown to cause such issues as under-trained tokens and sub-optimal compression that may affect the downstream performance. We introduce Picky BPE, a modified BPE algorithm that carries out vocabulary refinement during tokenizer training. Our method improves vocabulary efficiency, eliminates under-trained tokens, and does not compromise text compression. Our experiments show that our method does not reduce the downstream performance, and in several cases improves it.

* 9 pages

Via

Access Paper or Ask Questions

Beyond Toxic: Toxicity Detection Datasets are Not Enough for Brand Safety

Mar 27, 2023

Elizaveta Korotkova, Isaac Kwan Yin Chung

Abstract:The rapid growth in user generated content on social media has resulted in a significant rise in demand for automated content moderation. Various methods and frameworks have been proposed for the tasks of hate speech detection and toxic comment classification. In this work, we combine common datasets to extend these tasks to brand safety. Brand safety aims to protect commercial branding by identifying contexts where advertisements should not appear and covers not only toxicity, but also other potentially harmful content. As these datasets contain different label sets, we approach the overall problem as a binary classification task. We demonstrate the need for building brand safety specific datasets via the application of common toxicity detection datasets to a subset of brand safety and empirically analyze the effects of weighted sampling strategies in text classification.

Via

Access Paper or Ask Questions

Translation Transformers Rediscover Inherent Data Domains

Sep 16, 2021

Maksym Del, Elizaveta Korotkova, Mark Fishel

Figure 1 for Translation Transformers Rediscover Inherent Data Domains

Figure 2 for Translation Transformers Rediscover Inherent Data Domains

Figure 3 for Translation Transformers Rediscover Inherent Data Domains

Figure 4 for Translation Transformers Rediscover Inherent Data Domains

Abstract:Many works proposed methods to improve the performance of Neural Machine Translation (NMT) models in a domain/multi-domain adaptation scenario. However, an understanding of how NMT baselines represent text domain information internally is still lacking. Here we analyze the sentence representations learned by NMT Transformers and show that these explicitly include the information on text domains, even after only seeing the input sentences without domains labels. Furthermore, we show that this internal information is enough to cluster sentences by their underlying domains without supervision. We show that NMT models produce clusters better aligned to the actual domains compared to pre-trained language models (LMs). Notably, when computed on document-level, NMT cluster-to-domain correspondence nears 100%. We use these findings together with an approach to NMT domain adaptation using automatically extracted domains. Whereas previous work relied on external LMs for text clustering, we propose re-using the NMT model as a source of unsupervised clusters. We perform an extensive experimental study comparing two approaches across two data scenarios, three language pairs, and both sentence-level and document-level clustering, showing equal or significantly superior performance compared to LMs.

* Accepted at WMT21; 15 pages, 7 figures

Via

Access Paper or Ask Questions

Grammatical Error Correction and Style Transfer via Zero-shot Monolingual Translation

Mar 27, 2019

Elizaveta Korotkova, Agnes Luhtaru, Maksym Del, Krista Liin, Daiga Deksne, Mark Fishel

Figure 1 for Grammatical Error Correction and Style Transfer via Zero-shot Monolingual Translation

Figure 2 for Grammatical Error Correction and Style Transfer via Zero-shot Monolingual Translation

Figure 3 for Grammatical Error Correction and Style Transfer via Zero-shot Monolingual Translation

Figure 4 for Grammatical Error Correction and Style Transfer via Zero-shot Monolingual Translation

Abstract:Both grammatical error correction and text style transfer can be viewed as monolingual sequence-to-sequence transformation tasks, but the scarcity of directly annotated data for either task makes them unfeasible for most languages. We present an approach that does both tasks within the same trained model, and only uses regular language parallel data, without requiring error-corrected or style-adapted texts. We apply our model to three languages and present a thorough evaluation on both tasks, showing that the model is reliable for a number of error types and style transfer aspects.

Via

Access Paper or Ask Questions

Monolingual and Cross-lingual Zero-shot Style Transfer

Aug 01, 2018

Elizaveta Korotkova, Maksym Del, Mark Fishel

Figure 1 for Monolingual and Cross-lingual Zero-shot Style Transfer

Figure 2 for Monolingual and Cross-lingual Zero-shot Style Transfer

Figure 3 for Monolingual and Cross-lingual Zero-shot Style Transfer

Figure 4 for Monolingual and Cross-lingual Zero-shot Style Transfer

Abstract:We introduce the task of zero-shot style transfer between different languages. Our training data includes multilingual parallel corpora, but does not contain any parallel sentences between styles, similarly to the recent previous work. We propose a unified multilingual multi-style machine translation system design, that allows to perform zero-shot style conversions during inference; moreover, it does so both monolingually and cross-lingually. Our model allows to increase the presence of dissimilar styles in corpus by up to 3 times, easily learns to operate with various contractions, and provides reasonable lexicon swaps as we see from manual evaluation.

Via

Access Paper or Ask Questions