Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dana Ruiter

StereoKG: Data-Driven Knowledge Graph Construction for Cultural Knowledge and Stereotypes

May 27, 2022

Awantee Deshpande, Dana Ruiter, Marius Mosbach, Dietrich Klakow

Figure 1 for StereoKG: Data-Driven Knowledge Graph Construction for Cultural Knowledge and Stereotypes

Figure 2 for StereoKG: Data-Driven Knowledge Graph Construction for Cultural Knowledge and Stereotypes

Figure 3 for StereoKG: Data-Driven Knowledge Graph Construction for Cultural Knowledge and Stereotypes

Figure 4 for StereoKG: Data-Driven Knowledge Graph Construction for Cultural Knowledge and Stereotypes

Abstract:Analyzing ethnic or religious bias is important for improving fairness, accountability, and transparency of natural language processing models. However, many techniques rely on human-compiled lists of bias terms, which are expensive to create and are limited in coverage. In this study, we present a fully data-driven pipeline for generating a knowledge graph (KG) of cultural knowledge and stereotypes. Our resulting KG covers 5 religious groups and 5 nationalities and can easily be extended to include more entities. Our human evaluation shows that the majority (59.2%) of non-singleton entries are coherent and complete stereotypes. We further show that performing intermediate masked language model training on the verbalized KG leads to a higher level of cultural awareness in the model and has the potential to increase classification performance on knowledge-crucial samples on a related task, i.e., hate speech detection.

* 12 pages, 2 figures, accepted as a long paper at WOAH at NAACL 2022

Via

Access Paper or Ask Questions

Exploiting Social Media Content for Self-Supervised Style Transfer

May 18, 2022

Dana Ruiter, Thomas Kleinbauer, Cristina España-Bonet, Josef van Genabith, Dietrich Klakow

Figure 1 for Exploiting Social Media Content for Self-Supervised Style Transfer

Figure 2 for Exploiting Social Media Content for Self-Supervised Style Transfer

Figure 3 for Exploiting Social Media Content for Self-Supervised Style Transfer

Figure 4 for Exploiting Social Media Content for Self-Supervised Style Transfer

Abstract:Recent research on style transfer takes inspiration from unsupervised neural machine translation (UNMT), learning from large amounts of non-parallel data by exploiting cycle consistency loss, back-translation, and denoising autoencoders. By contrast, the use of self-supervised NMT (SSNMT), which leverages (near) parallel instances hidden in non-parallel data more efficiently than UNMT, has not yet been explored for style transfer. In this paper we present a novel Self-Supervised Style Transfer (3ST) model, which augments SSNMT with UNMT methods in order to identify and efficiently exploit supervisory signals in non-parallel social media posts. We compare 3ST with state-of-the-art (SOTA) style transfer models across civil rephrasing, formality and polarity tasks. We show that 3ST is able to balance the three major objectives (fluency, content preservation, attribute transfer accuracy) the best, outperforming SOTA models on averaged performance across their tested tasks in automatic and human evaluation.

* 13 pages, 2 figures, accepted as a long paper at SocialNLP 2022 (@NAACL)

Via

Access Paper or Ask Questions

A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

May 04, 2022

David Ifeoluwa Adelani, Jesujoba Oluwadara Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, Dietrich Klakow, Peter Nabende, Ernie Chang(+35 more)

Figure 1 for A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

Figure 2 for A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

Figure 3 for A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

Figure 4 for A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

Abstract:Recent advances in the pre-training of language models leverage large-scale datasets to create multilingual models. However, low-resource languages are mostly left out in these datasets. This is primarily because many widely spoken languages are not well represented on the web and therefore excluded from the large-scale crawls used to create datasets. Furthermore, downstream users of these models are restricted to the selection of languages originally chosen for pre-training. This work investigates how to optimally leverage existing pre-trained models to create low-resource translation systems for 16 African languages. We focus on two questions: 1) How can pre-trained models be used for languages not included in the initial pre-training? and 2) How can the resulting translation models effectively transfer to new domains? To answer these questions, we create a new African news corpus covering 16 languages, of which eight languages are not part of any existing evaluation dataset. We demonstrate that the most effective strategy for transferring both to additional languages and to additional domains is to fine-tune large pre-trained models on small quantities of high-quality translation data.

* Accepted to NAACL 2022

Via

Access Paper or Ask Questions

Placing M-Phasis on the Plurality of Hate: A Feature-Based Corpus of Hate Online

Apr 28, 2022

Dana Ruiter, Liane Reiners, Ashwin Geet D'Sa, Thomas Kleinbauer, Dominique Fohr, Irina Illina, Dietrich Klakow, Christian Schemer, Angeliki Monnier

Figure 1 for Placing M-Phasis on the Plurality of Hate: A Feature-Based Corpus of Hate Online

Figure 2 for Placing M-Phasis on the Plurality of Hate: A Feature-Based Corpus of Hate Online

Figure 3 for Placing M-Phasis on the Plurality of Hate: A Feature-Based Corpus of Hate Online

Figure 4 for Placing M-Phasis on the Plurality of Hate: A Feature-Based Corpus of Hate Online

Abstract:Even though hate speech (HS) online has been an important object of research in the last decade, most HS-related corpora over-simplify the phenomenon of hate by attempting to label user comments as "hate" or "neutral". This ignores the complex and subjective nature of HS, which limits the real-life applicability of classifiers trained on these corpora. In this study, we present the M-Phasis corpus, a corpus of ~9k German and French user comments collected from migration-related news articles. It goes beyond the "hate"-"neutral" dichotomy and is instead annotated with 23 features, which in combination become descriptors of various types of speech, ranging from critical comments to implicit and explicit expressions of hate. The annotations are performed by 4 native speakers per language and achieve high (0.77 <= k <= 1) inter-annotator agreements. Besides describing the corpus creation and presenting insights from a content, error and domain analysis, we explore its data characteristics by training several classification baselines.

* 14 pages, 4 figures, accepted at LREC 2022 (Full Paper)

Via

Access Paper or Ask Questions

EdinSaar@WMT21: North-Germanic Low-Resource Multilingual NMT

Sep 29, 2021

Svetlana Tchistiakova, Jesujoba Alabi, Koel Dutta Chowdhury, Sourav Dutta, Dana Ruiter

Figure 1 for EdinSaar@WMT21: North-Germanic Low-Resource Multilingual NMT

Figure 2 for EdinSaar@WMT21: North-Germanic Low-Resource Multilingual NMT

Figure 3 for EdinSaar@WMT21: North-Germanic Low-Resource Multilingual NMT

Abstract:We describe the EdinSaar submission to the shared task of Multilingual Low-Resource Translation for North Germanic Languages at the Sixth Conference on Machine Translation (WMT2021). We submit multilingual translation models for translations to/from Icelandic (is), Norwegian-Bokmal (nb), and Swedish (sv). We employ various experimental approaches, including multilingual pre-training, back-translation, fine-tuning, and ensembling. In most translation directions, our models outperform other submitted systems.

* To be published WMT2021

Via

Access Paper or Ask Questions

Integrating Unsupervised Data Generation into Self-Supervised Neural Machine Translation for Low-Resource Languages

Jul 19, 2021

Dana Ruiter, Dietrich Klakow, Josef van Genabith, Cristina España-Bonet

Figure 1 for Integrating Unsupervised Data Generation into Self-Supervised Neural Machine Translation for Low-Resource Languages

Figure 2 for Integrating Unsupervised Data Generation into Self-Supervised Neural Machine Translation for Low-Resource Languages

Figure 3 for Integrating Unsupervised Data Generation into Self-Supervised Neural Machine Translation for Low-Resource Languages

Figure 4 for Integrating Unsupervised Data Generation into Self-Supervised Neural Machine Translation for Low-Resource Languages

Abstract:For most language combinations, parallel data is either scarce or simply unavailable. To address this, unsupervised machine translation (UMT) exploits large amounts of monolingual data by using synthetic data generation techniques such as back-translation and noising, while self-supervised NMT (SSNMT) identifies parallel sentences in smaller comparable data and trains on them. To date, the inclusion of UMT data generation techniques in SSNMT has not been investigated. We show that including UMT techniques into SSNMT significantly outperforms SSNMT and UMT on all tested language pairs, with improvements of up to +4.3 BLEU, +50.8 BLEU, +51.5 over SSNMT, statistical UMT and hybrid UMT, respectively, on Afrikaans to English. We further show that the combination of multilingual denoising autoencoding, SSNMT with backtranslation and bilingual finetuning enables us to learn machine translation even for distant language pairs for which only small amounts of monolingual data are available, e.g. yielding BLEU scores of 11.6 (English to Swahili).

* 11 pages, 8 figures, accepted at MT-Summit 2021 (Research Track)

Via

Access Paper or Ask Questions

Modeling Profanity and Hate Speech in Social Media with Semantic Subspaces

Jun 18, 2021

Vanessa Hahn, Dana Ruiter, Thomas Kleinbauer, Dietrich Klakow

Figure 1 for Modeling Profanity and Hate Speech in Social Media with Semantic Subspaces

Figure 2 for Modeling Profanity and Hate Speech in Social Media with Semantic Subspaces

Figure 3 for Modeling Profanity and Hate Speech in Social Media with Semantic Subspaces

Figure 4 for Modeling Profanity and Hate Speech in Social Media with Semantic Subspaces

Abstract:Hate speech and profanity detection suffer from data sparsity, especially for languages other than English, due to the subjective nature of the tasks and the resulting annotation incompatibility of existing corpora. In this study, we identify profane subspaces in word and sentence representations and explore their generalization capability on a variety of similar and distant target tasks in a zero-shot setting. This is done monolingually (German) and cross-lingually to closely-related (English), distantly-related (French) and non-related (Arabic) tasks. We observe that, on both similar and distant target tasks and across all languages, the subspace-based representations transfer more effectively than standard BERT representations in the zero-shot setting, with improvements between F1 +10.9 and F1 +42.9 over the baselines across all tested monolingual and cross-lingual scenarios.

* 9 pages, 4 figures, accepted as a long paper at Workshop on Online Abuse and Harms 2021

Via

Access Paper or Ask Questions

MENYO-20k: A Multi-domain English-Yorùbá Corpus for Machine Translation and Domain Adaptation

Mar 15, 2021

David I. Adelani, Dana Ruiter, Jesujoba O. Alabi, Damilola Adebonojo, Adesina Ayeni, Mofe Adeyemi, Ayodele Awokoya, Cristina España-Bonet

Figure 1 for MENYO-20k: A Multi-domain English-Yorùbá Corpus for Machine Translation and Domain Adaptation

Figure 2 for MENYO-20k: A Multi-domain English-Yorùbá Corpus for Machine Translation and Domain Adaptation

Figure 3 for MENYO-20k: A Multi-domain English-Yorùbá Corpus for Machine Translation and Domain Adaptation

Figure 4 for MENYO-20k: A Multi-domain English-Yorùbá Corpus for Machine Translation and Domain Adaptation

Abstract:Massively multilingual machine translation (MT) has shown impressive capabilities, including zero and few-shot translation between low-resource language pairs. However, these models are often evaluated on high-resource languages with the assumption that they generalize to low-resource ones. The difficulty of evaluating MT models on low-resource pairs is often due the lack of standardized evaluation datasets. In this paper, we present MENYO-20k, the first multi-domain parallel corpus for the low-resource Yor\`ub\'a--English (yo--en) language pair with standardized train-test splits for benchmarking. We provide several neural MT (NMT) benchmarks on this dataset and compare to the performance of popular pre-trained (massively multilingual) MT models, showing that, in almost all cases, our simple benchmarks outperform the pre-trained MT models. A major gain of BLEU $+9.9$ and $+8.6$ (en2yo) is achieved in comparison to Facebook's M2M-100 and Google multilingual NMT respectively when we use MENYO-20k to fine-tune generic models.

* Accepted to the AfricaNLP 2021 Workshop @EACL 2021

Via

Access Paper or Ask Questions

Emoji-Based Transfer Learning for Sentiment Tasks

Feb 12, 2021

Susann Boy, Dana Ruiter, Dietrich Klakow

Figure 1 for Emoji-Based Transfer Learning for Sentiment Tasks

Figure 2 for Emoji-Based Transfer Learning for Sentiment Tasks

Figure 3 for Emoji-Based Transfer Learning for Sentiment Tasks

Figure 4 for Emoji-Based Transfer Learning for Sentiment Tasks

Abstract:Sentiment tasks such as hate speech detection and sentiment analysis, especially when performed on languages other than English, are often low-resource. In this study, we exploit the emotional information encoded in emojis to enhance the performance on a variety of sentiment tasks. This is done using a transfer learning approach, where the parameters learned by an emoji-based source task are transferred to a sentiment target task. We analyse the efficacy of the transfer under three conditions, i.e. i) the emoji content and ii) label distribution of the target task as well as iii) the difference between monolingually and multilingually learned source tasks. We find i.a. that the transfer is most beneficial if the target task is balanced with high emoji content. Monolingually learned source tasks have the benefit of taking into account the culturally specific use of emojis and gain up to F1 +0.280 over the baseline.

* 6 pages, 2 figures, accepted at EACL-SRW 2021

Via

Access Paper or Ask Questions

HUMAN: Hierarchical Universal Modular ANnotator

Oct 02, 2020

Moritz Wolf, Dana Ruiter, Ashwin Geet D'Sa, Liane Reiners, Jan Alexandersson, Dietrich Klakow

Figure 1 for HUMAN: Hierarchical Universal Modular ANnotator

Figure 2 for HUMAN: Hierarchical Universal Modular ANnotator

Figure 3 for HUMAN: Hierarchical Universal Modular ANnotator

Figure 4 for HUMAN: Hierarchical Universal Modular ANnotator

Abstract:A lot of real-world phenomena are complex and cannot be captured by single task annotations. This causes a need for subsequent annotations, with interdependent questions and answers describing the nature of the subject at hand. Even in the case a phenomenon is easily captured by a single task, the high specialisation of most annotation tools can result in having to switch to another tool if the task only slightly changes. We introduce HUMAN, a novel web-based annotation tool that addresses the above problems by a) covering a variety of annotation tasks on both textual and image data, and b) the usage of an internal deterministic state machine, allowing the researcher to chain different annotation tasks in an interdependent manner. Further, the modular nature of the tool makes it easy to define new annotation tasks and integrate machine learning algorithms e.g., for active learning. HUMAN comes with an easy-to-use graphical user interface that simplifies the annotation task and management.

* 7 pages, 4 figures, EMNLP - Demonstrations 2020

Via

Access Paper or Ask Questions