Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Juuso Eronen

Cyberbullying Detection for Low-resource Languages and Dialects: Review of the State of the Art

Aug 30, 2023

Tanjim Mahmud, Michal Ptaszynski, Juuso Eronen, Fumito Masui

Abstract:The struggle of social media platforms to moderate content in a timely manner, encourages users to abuse such platforms to spread vulgar or abusive language, which, when performed repeatedly becomes cyberbullying a social problem taking place in virtual environments, yet with real-world consequences, such as depression, withdrawal, or even suicide attempts of its victims. Systems for the automatic detection and mitigation of cyberbullying have been developed but, unfortunately, the vast majority of them are for the English language, with only a handful available for low-resource languages. To estimate the present state of research and recognize the needs for further development, in this paper we present a comprehensive systematic survey of studies done so far for automatic cyberbullying detection in low-resource languages. We analyzed all studies on this topic that were available. We investigated more than seventy published studies on automatic detection of cyberbullying or related language in low-resource languages and dialects that were published between around 2017 and January 2023. There are 23 low-resource languages and dialects covered by this paper, including Bangla, Hindi, Dravidian languages and others. In the survey, we identify some of the research gaps of previous studies, which include the lack of reliable definitions of cyberbullying and its relevant subcategories, biases in the acquisition, and annotation of data. Based on recognizing those research gaps, we provide some suggestions for improving the general research conduct in cyberbullying detection, with a primary focus on low-resource languages. Based on those proposed suggestions, we collect and release a cyberbullying dataset in the Chittagonian dialect of Bangla and propose a number of initial ML solutions trained on that dataset. In addition, pre-trained transformer-based the BanglaBERT model was also attempted.

* Information Processing & Management 2023
* 52 Pages

Via

Access Paper or Ask Questions

Improving Polish to English Neural Machine Translation with Transfer Learning: Effects of Data Volume and Language Similarity

Jun 01, 2023

Juuso Eronen, Michal Ptaszynski, Karol Nowakowski, Zheng Lin Chia, Fumito Masui

Abstract:This paper investigates the impact of data volume and the use of similar languages on transfer learning in a machine translation task. We find out that having more data generally leads to better performance, as it allows the model to learn more patterns and generalizations from the data. However, related languages can also be particularly effective when there is limited data available for a specific language pair, as the model can leverage the similarities between the languages to improve performance. To demonstrate, we fine-tune mBART model for a Polish-English translation task using the OPUS-100 dataset. We evaluate the performance of the model under various transfer learning configurations, including different transfer source languages and different shot levels for Polish, and report the results. Our experiments show that a combination of related languages and larger amounts of data outperforms the model trained on related languages or larger amounts of data alone. Additionally, we show the importance of related languages in zero-shot and few-shot configurations.

Via

Access Paper or Ask Questions

Zero-shot cross-lingual transfer language selection using linguistic similarity

Jan 31, 2023

Juuso Eronen, Michal Ptaszynski, Fumito Masui

Abstract:We study the selection of transfer languages for different Natural Language Processing tasks, specifically sentiment analysis, named entity recognition and dependency parsing. In order to select an optimal transfer language, we propose to utilize different linguistic similarity metrics to measure the distance between languages and make the choice of transfer language based on this information instead of relying on intuition. We demonstrate that linguistic similarity correlates with cross-lingual transfer performance for all of the proposed tasks. We also show that there is a statistically significant difference in choosing the optimal language as the transfer source instead of English. This allows us to select a more suitable transfer language which can be used to better leverage knowledge from high-resource languages in order to improve the performance of language applications lacking data. For the study, we used datasets from eight different languages from three language families.

* Information Processing & Management, Volume 60, Issue 3, 2023, 103250

Via

Access Paper or Ask Questions

Comparing Performance of Different Linguistically-Backed Word Embeddings for Cyberbullying Detection

Jun 04, 2022

Juuso Eronen, Michal Ptaszynski, Fumito Masui

Abstract:In most cases, word embeddings are learned only from raw tokens or in some cases, lemmas. This includes pre-trained language models like BERT. To investigate on the potential of capturing deeper relations between lexical items and structures and to filter out redundant information, we propose to preserve the morphological, syntactic and other types of linguistic information by combining them with the raw tokens or lemmas. This means, for example, including parts-of-speech or dependency information within the used lexical features. The word embeddings can then be trained on the combinations instead of just raw tokens. It is also possible to later apply this method to the pre-training of huge language models and possibly enhance their performance. This would aid in tackling problems which are more sophisticated from the point of view of linguistic representation, such as detection of cyberbullying.

* Proceedings of the 2021 International Workshop on Modern Science and Technology, September 29, 2021

Via

Access Paper or Ask Questions

Exploring the Potential of Feature Density in Estimating Machine Learning Classifier Performance with Application to Cyberbullying Detection

Jun 04, 2022

Juuso Eronen, Michal Ptaszynski, Fumito Masui, Gniewosz Leliwa, Michal Wroczynski

Figure 1 for Exploring the Potential of Feature Density in Estimating Machine Learning Classifier Performance with Application to Cyberbullying Detection

Figure 2 for Exploring the Potential of Feature Density in Estimating Machine Learning Classifier Performance with Application to Cyberbullying Detection

Figure 3 for Exploring the Potential of Feature Density in Estimating Machine Learning Classifier Performance with Application to Cyberbullying Detection

Figure 4 for Exploring the Potential of Feature Density in Estimating Machine Learning Classifier Performance with Application to Cyberbullying Detection

Abstract:In this research. we analyze the potential of Feature Density (HD) as a way to comparatively estimate machine learning (ML) classifier performance prior to training. The goal of the study is to aid in solving the problem of resource-intensive training of ML models which is becoming a serious issue due to continuously increasing dataset sizes and the ever rising popularity of Deep Neural Networks (DNN). The issue of constantly increasing demands for more powerful computational resources is also affecting the environment, as training large-scale ML models are causing alarmingly-growing amounts of CO2, emissions. Our approach 1s to optimize the resource-intensive training of ML models for Natural Language Processing to reduce the number of required experiments iterations. We expand on previous attempts on improving classifier training efficiency with FD while also providing an insight to the effectiveness of various linguistically-backed feature preprocessing methods for dialog classification, specifically cyberbullying detection.

* he 7th Workshop on Linguistic and Cognitive Approaches to Dialog Agents (LaCATODA 2021) collocated with IJCAI 2021,August 21--26th, 2021, Montreal, Canada. CEUR Workshop Proceedings 2935, 5-14
* arXiv admin note: substantial text overlap with arXiv:2111.01689

Via

Access Paper or Ask Questions

Initial Study into Application of Feature Density and Linguistically-backed Embedding to Improve Machine Learning-based Cyberbullying Detection

Jun 04, 2022

Juuso Eronen, Michal Ptaszynski, Fumito Masui, Gniewosz Leliwa, Michal Wroczynski, Mateusz Piech, Aleksander Smywinski-Pohl

Figure 1 for Initial Study into Application of Feature Density and Linguistically-backed Embedding to Improve Machine Learning-based Cyberbullying Detection

Figure 2 for Initial Study into Application of Feature Density and Linguistically-backed Embedding to Improve Machine Learning-based Cyberbullying Detection

Figure 3 for Initial Study into Application of Feature Density and Linguistically-backed Embedding to Improve Machine Learning-based Cyberbullying Detection

Abstract:In this research, we study the change in the performance of machine learning (ML) classifiers when various linguistic preprocessing methods of a dataset were used, with the specific focus on linguistically-backed embeddings in Convolutional Neural Networks (CNN). Moreover, we study the concept of Feature Density and confirm its potential to comparatively predict the performance of ML classifiers, including CNN. The research was conducted on a Formspring dataset provided in a Kaggle competition on automatic cyberbullying detection. The dataset was re-annotated by objective experts (psychologists), as the importance of professional annotation in cyberbullying research has been indicated multiple times. The study confirmed the effectiveness of Neural Networks in cyberbullying detection and the correlation between classifier performance and Feature Density while also proposing a new approach of training various linguistically-backed embeddings for Convolutional Neural Networks.

* Proceedings of The 6th Linguistic and Cognitive Approaches to Dialog Agents (LaCATODA 2020) IJCAI 2020 Workshop, Yokohama, Japan, January 2020

Via

Access Paper or Ask Questions

Transfer Language Selection for Zero-Shot Cross-Lingual Abusive Language Detection

Jun 02, 2022

Juuso Eronen, Michal Ptaszynski, Fumito Masui, Masaki Arata, Gniewosz Leliwa, Michal Wroczynski

Figure 1 for Transfer Language Selection for Zero-Shot Cross-Lingual Abusive Language Detection

Figure 2 for Transfer Language Selection for Zero-Shot Cross-Lingual Abusive Language Detection

Figure 3 for Transfer Language Selection for Zero-Shot Cross-Lingual Abusive Language Detection

Figure 4 for Transfer Language Selection for Zero-Shot Cross-Lingual Abusive Language Detection

Abstract:We study the selection of transfer languages for automatic abusive language detection. Instead of preparing a dataset for every language, we demonstrate the effectiveness of cross-lingual transfer learning for zero-shot abusive language detection. This way we can use existing data from higher-resource languages to build better detection systems for low-resource languages. Our datasets are from seven different languages from three language families. We measure the distance between the languages using several language similarity measures, especially by quantifying the World Atlas of Language Structures. We show that there is a correlation between linguistic similarity and classifier performance. This discovery allows us to choose an optimal transfer language for zero shot abusive language detection.

* Information Processing & Management, Volume 59, Issue 4, July 2022, paper ID: 102981

Via

Access Paper or Ask Questions

Improving Classifier Training Efficiency for Automatic Cyberbullying Detection with Feature Density

Nov 03, 2021

Juuso Eronen, Michal Ptaszynski, Fumito Masui, Aleksander Smywiński-Pohl, Gniewosz Leliwa, Michal Wroczynski

Figure 1 for Improving Classifier Training Efficiency for Automatic Cyberbullying Detection with Feature Density

Figure 2 for Improving Classifier Training Efficiency for Automatic Cyberbullying Detection with Feature Density

Figure 3 for Improving Classifier Training Efficiency for Automatic Cyberbullying Detection with Feature Density

Figure 4 for Improving Classifier Training Efficiency for Automatic Cyberbullying Detection with Feature Density

Abstract:We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods in order to estimate dataset complexity, which in turn is used to comparatively estimate the potential performance of machine learning (ML) classifiers prior to any training. We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments iterations. This way we can optimize the resource-intensive training of ML models which is becoming a serious issue due to the increases in available dataset sizes and the ever rising popularity of models based on Deep Neural Networks (DNN). The problem of constantly increasing needs for more powerful computational resources is also affecting the environment due to alarmingly-growing amount of CO2 emissions caused by training of large-scale ML models. The research was conducted on multiple datasets, including popular datasets, such as Yelp business review dataset used for training typical sentiment analysis models, as well as more recent datasets trying to tackle the problem of cyberbullying, which, being a serious social problem, is also a much more sophisticated problem form the point of view of linguistic representation. We use cyberbullying datasets collected for multiple languages, namely English, Japanese and Polish. The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.

* Information Processing and Management, Vol. 58, Issue 5, September 2021, paper ID: 102616
* 73 pages, 4 figures, 19 tables, Information Processing and Management, Vol. 58, Issue 5, September 2021, paper ID: 102616

Via

Access Paper or Ask Questions