Abstract:Hate speech is a growing problem on social media. It can seriously impact society, especially in countries like Ethiopia, where it can trigger conflicts among diverse ethnic and religious groups. While hate speech detection in resource rich languages are progressing, for low resource languages such as Amharic are lacking. To address this gap, we develop Amharic hate speech data and SBi-LSTM deep learning model that can detect and classify text into four categories of hate speech: racial, religious, gender, and non-hate speech. We have annotated 5k Amharic social media post and comment data into four categories. The data is annotated using a custom annotation tool by a total of 100 native Amharic speakers. The model achieves a 94.8 F1-score performance. Future improvements will include expanding the dataset and develop state-of-the art models. Keywords: Amharic hate speech detection, classification, Amharic dataset, Deep Learning, SBi-LSTM
Abstract:Question Answering (QA) returns concise answers or answer lists from natural language text given a context document. Many resources go into curating QA datasets to advance robust models' development. There is a surge of QA datasets for languages like English, however, this is not true for Amharic. Amharic, the official language of Ethiopia, is the second most spoken Semitic language in the world. There is no published or publicly available Amharic QA dataset. Hence, to foster the research in Amharic QA, we present the first Amharic QA (AmQA) dataset. We crowdsourced 2628 question-answer pairs over 378 Wikipedia articles. Additionally, we run an XLMR Large-based baseline model to spark open-domain QA research interest. The best-performing baseline achieves an F-score of 69.58 and 71.74 in reader-retriever QA and reading comprehension settings respectively.