Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Danila Petrelli

BiaSWE: An Expert Annotated Dataset for Misogyny Detection in Swedish

Feb 11, 2025

Kätriin Kukk, Danila Petrelli, Judit Casademont, Eric J. W. Orlowski, Michał Dzieliński, Maria Jacobson

Abstract:In this study, we introduce the process for creating BiaSWE, an expert-annotated dataset tailored for misogyny detection in the Swedish language. To address the cultural and linguistic specificity of misogyny in Swedish, we collaborated with experts from the social sciences and humanities. Our interdisciplinary team developed a rigorous annotation process, incorporating both domain knowledge and language expertise, to capture the nuances of misogyny in a Swedish context. This methodology ensures that the dataset is not only culturally relevant but also aligned with broader efforts in bias detection for low-resource languages. The dataset, along with the annotation guidelines, is publicly available for further research.

* To appear at NoDaLiDa 2025

Via

Access Paper or Ask Questions

SWEb: A Large Web Dataset for the Scandinavian Languages

Oct 06, 2024

Tobias Norlund, Tim Isbister, Amaru Cuba Gyllensten, Paul Dos Santos, Danila Petrelli, Ariel Ekgren, Magnus Sahlgren

Figure 1 for SWEb: A Large Web Dataset for the Scandinavian Languages

Figure 2 for SWEb: A Large Web Dataset for the Scandinavian Languages

Figure 3 for SWEb: A Large Web Dataset for the Scandinavian Languages

Figure 4 for SWEb: A Large Web Dataset for the Scandinavian Languages

Abstract:This paper presents the hitherto largest pretraining dataset for the Scandinavian languages: the Scandinavian WEb (SWEb), comprising over one trillion tokens. The paper details the collection and processing pipeline, and introduces a novel model-based text extractor that significantly reduces complexity in comparison with rule-based approaches. We also introduce a new cloze-style benchmark for evaluating language models in Swedish, and use this test to compare models trained on the SWEb data to models trained on FineWeb, with competitive results. All data, models and code are shared openly.

Via

Access Paper or Ask Questions

Text Annotation Handbook: A Practical Guide for Machine Learning Projects

Oct 18, 2023

Felix Stollenwerk, Joey Öhman, Danila Petrelli, Emma Wallerö, Fredrik Olsson, Camilla Bengtsson, Andreas Horndahl, Gabriela Zarzar Gandler

Figure 1 for Text Annotation Handbook: A Practical Guide for Machine Learning Projects

Figure 2 for Text Annotation Handbook: A Practical Guide for Machine Learning Projects

Figure 3 for Text Annotation Handbook: A Practical Guide for Machine Learning Projects

Figure 4 for Text Annotation Handbook: A Practical Guide for Machine Learning Projects

Abstract:This handbook is a hands-on guide on how to approach text annotation tasks. It provides a gentle introduction to the topic, an overview of theoretical concepts as well as practical advice. The topics covered are mostly technical, but business, ethical and regulatory issues are also touched upon. The focus lies on readability and conciseness rather than completeness and scientific rigor. Experience with annotation and knowledge of machine learning are useful but not required. The document may serve as a primer or reference book for a wide range of professions such as team leaders, project managers, IT architects, software developers and machine learning engineers.

* 30 pages, white paper

Via

Access Paper or Ask Questions