Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wondimagegnhue Tsegaye Tufa

Grounding Toxicity in Real-World Events across Languages

May 22, 2024

Wondimagegnhue Tsegaye Tufa, Ilia Markov, Piek Vossen

Abstract:Social media conversations frequently suffer from toxicity, creating significant issues for users, moderators, and entire communities. Events in the real world, like elections or conflicts, can initiate and escalate toxic behavior online. Our study investigates how real-world events influence the origin and spread of toxicity in online discussions across various languages and regions. We gathered Reddit data comprising 4.5 million comments from 31 thousand posts in six different languages (Dutch, English, German, Arabic, Turkish and Spanish). We target fifteen major social and political world events that occurred between 2020 and 2023. We observe significant variations in toxicity, negative sentiment, and emotion expressions across different events and language communities, showing that toxicity is a complex phenomenon in which many different factors interact and still need to be investigated. We will release the data for further research along with our code.

* Paper accepted for at The 29th International Conference on Natural Language & Information Systems (NLDB 2024)

Via

Access Paper or Ask Questions

Unknown Script: Impact of Script on Cross-Lingual Transfer

Apr 29, 2024

Wondimagegnhue Tsegaye Tufa, Ilia Markov, Piek Vossen

Abstract:Cross-lingual transfer has become an effective way of transferring knowledge between languages. In this paper, we explore an often-overlooked aspect in this domain: the influence of the source language of the base language model on transfer performance. We conduct a series of experiments to determine the effect of the script and tokenizer used in the pre-trained model on the performance of the downstream task. Our findings reveal the importance of the tokenizer as a stronger factor than the sharing of the script, the language typology match, and the model size.

* Paper accepted to NAACL Student Research Workshop (SRW) 2024

Via

Access Paper or Ask Questions

The Constant in HATE: Analyzing Toxicity in Reddit across Topics and Languages

Apr 29, 2024

Wondimagegnhue Tsegaye Tufa, Ilia Markov, Piek Vossen

Abstract:Toxic language remains an ongoing challenge on social media platforms, presenting significant issues for users and communities. This paper provides a cross-topic and cross-lingual analysis of toxicity in Reddit conversations. We collect 1.5 million comment threads from 481 communities in six languages: English, German, Spanish, Turkish,Arabic, and Dutch, covering 80 topics such as Culture, Politics, and News. We thoroughly analyze how toxicity spikes within different communities in relation to specific topics. We observe consistent patterns of increased toxicity across languages for certain topics, while also noting significant variations within specific language communities.

* Accepted to TRAC 2024

Via

Access Paper or Ask Questions