Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sushant Chatufale

LAHM : Large Annotated Dataset for Multi-Domain and Multilingual Hate Speech Identification

Apr 03, 2023

Ankit Yadav, Shubham Chandel, Sushant Chatufale, Anil Bandhakavi

Figure 1 for LAHM : Large Annotated Dataset for Multi-Domain and Multilingual Hate Speech Identification

Figure 2 for LAHM : Large Annotated Dataset for Multi-Domain and Multilingual Hate Speech Identification

Figure 3 for LAHM : Large Annotated Dataset for Multi-Domain and Multilingual Hate Speech Identification

Figure 4 for LAHM : Large Annotated Dataset for Multi-Domain and Multilingual Hate Speech Identification

Abstract:Current research on hate speech analysis is typically oriented towards monolingual and single classification tasks. In this paper, we present a new multilingual hate speech analysis dataset for English, Hindi, Arabic, French, German and Spanish languages for multiple domains across hate speech - Abuse, Racism, Sexism, Religious Hate and Extremism. To the best of our knowledge, this paper is the first to address the problem of identifying various types of hate speech in these five wide domains in these six languages. In this work, we describe how we created the dataset, created annotations at high level and low level for different domains and how we use it to test the current state-of-the-art multilingual and multitask learning approaches. We evaluate our dataset in various monolingual, cross-lingual and machine translation classification settings and compare it against open source English datasets that we aggregated and merged for this task. Then we discuss how this approach can be used to create large scale hate-speech datasets and how to leverage our annotations in order to improve hate speech detection and classification in general.

Via

Access Paper or Ask Questions