Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Aleksander Smywiński-Pohl

eFontes. Part of Speech Tagging and Lemmatization of Medieval Latin Texts.A Cross-Genre Survey

Jun 29, 2024

Krzysztof Nowak, Jędrzej Ziębura, Krzysztof Wróbel, Aleksander Smywiński-Pohl

Figure 1 for eFontes. Part of Speech Tagging and Lemmatization of Medieval Latin Texts.A Cross-Genre Survey

Figure 2 for eFontes. Part of Speech Tagging and Lemmatization of Medieval Latin Texts.A Cross-Genre Survey

Figure 3 for eFontes. Part of Speech Tagging and Lemmatization of Medieval Latin Texts.A Cross-Genre Survey

Figure 4 for eFontes. Part of Speech Tagging and Lemmatization of Medieval Latin Texts.A Cross-Genre Survey

Abstract:This study introduces the eFontes models for automatic linguistic annotation of Medieval Latin texts, focusing on lemmatization, part-of-speech tagging, and morphological feature determination. Using the Transformers library, these models were trained on Universal Dependencies (UD) corpora and the newly developed eFontes corpus of Polish Medieval Latin. The research evaluates the models' performance, addressing challenges such as orthographic variations and the integration of Latinized vernacular terms. The models achieved high accuracy rates: lemmatization at 92.60%, part-of-speech tagging at 83.29%, and morphological feature determination at 88.57%. The findings underscore the importance of high-quality annotated corpora and propose future enhancements, including extending the models to Named Entity Recognition.

Via

Access Paper or Ask Questions

Improving Classifier Training Efficiency for Automatic Cyberbullying Detection with Feature Density

Nov 03, 2021

Juuso Eronen, Michal Ptaszynski, Fumito Masui, Aleksander Smywiński-Pohl, Gniewosz Leliwa, Michal Wroczynski

Figure 1 for Improving Classifier Training Efficiency for Automatic Cyberbullying Detection with Feature Density

Figure 2 for Improving Classifier Training Efficiency for Automatic Cyberbullying Detection with Feature Density

Figure 3 for Improving Classifier Training Efficiency for Automatic Cyberbullying Detection with Feature Density

Figure 4 for Improving Classifier Training Efficiency for Automatic Cyberbullying Detection with Feature Density

Abstract:We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods in order to estimate dataset complexity, which in turn is used to comparatively estimate the potential performance of machine learning (ML) classifiers prior to any training. We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments iterations. This way we can optimize the resource-intensive training of ML models which is becoming a serious issue due to the increases in available dataset sizes and the ever rising popularity of models based on Deep Neural Networks (DNN). The problem of constantly increasing needs for more powerful computational resources is also affecting the environment due to alarmingly-growing amount of CO2 emissions caused by training of large-scale ML models. The research was conducted on multiple datasets, including popular datasets, such as Yelp business review dataset used for training typical sentiment analysis models, as well as more recent datasets trying to tackle the problem of cyberbullying, which, being a serious social problem, is also a much more sophisticated problem form the point of view of linguistic representation. We use cyberbullying datasets collected for multiple languages, namely English, Japanese and Polish. The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.

* Information Processing and Management, Vol. 58, Issue 5, September 2021, paper ID: 102616
* 73 pages, 4 figures, 19 tables, Information Processing and Management, Vol. 58, Issue 5, September 2021, paper ID: 102616

Via

Access Paper or Ask Questions

Cyberbullying Detection -- Technical Report 2/2018, Department of Computer Science AGH, University of Science and Technology

Aug 02, 2018

Michał Ptaszyński, Gniewosz Leliwa, Mateusz Piech, Aleksander Smywiński-Pohl

Figure 1 for Cyberbullying Detection -- Technical Report 2/2018, Department of Computer Science AGH, University of Science and Technology

Figure 2 for Cyberbullying Detection -- Technical Report 2/2018, Department of Computer Science AGH, University of Science and Technology

Figure 3 for Cyberbullying Detection -- Technical Report 2/2018, Department of Computer Science AGH, University of Science and Technology

Figure 4 for Cyberbullying Detection -- Technical Report 2/2018, Department of Computer Science AGH, University of Science and Technology

Abstract:The research described in this paper concerns automatic cyberbullying detection in social media. There are two goals to achieve: building a gold standard cyberbullying detection dataset and measuring the performance of the Samurai cyberbullying detection system. The Formspring dataset provided in a Kaggle competition was re-annotated as a part of the research. The annotation procedure is described in detail and, unlike many other recent data annotation initiatives, does not use Mechanical Turk for finding people willing to perform the annotation. The new annotation compared to the old one seems to be more coherent since all tested cyberbullying detection system performed better on the former. The performance of the Samurai system is compared with 5 commercial systems and one well-known machine learning algorithm, used for classifying textual content, namely Fasttext. It turns out that Samurai scores the best in all measures (accuracy, precision and recall), while Fasttext is the second-best performing algorithm.

Via

Access Paper or Ask Questions

Improving text classification with vectors of reduced precision

Jun 20, 2017

Krzysztof Wróbel, Maciej Wielgosz, Marcin Pietroń, Michał Karwatowski, Aleksander Smywiński-Pohl

Figure 1 for Improving text classification with vectors of reduced precision

Figure 2 for Improving text classification with vectors of reduced precision

Figure 3 for Improving text classification with vectors of reduced precision

Figure 4 for Improving text classification with vectors of reduced precision

Abstract:This paper presents the analysis of the impact of a floating-point number precision reduction on the quality of text classification. The precision reduction of the vectors representing the data (e.g. TF-IDF representation in our case) allows for a decrease of computing time and memory footprint on dedicated hardware platforms. The impact of precision reduction on the classification quality was performed on 5 corpora, using 4 different classifiers. Also, dimensionality reduction was taken into account. Results indicate that the precision reduction improves classification accuracy for most cases (up to 25% of error reduction). In general, the reduction from 64 to 4 bits gives the best scores and ensures that the results will not be worse than with the full floating-point representation.

Via

Access Paper or Ask Questions