Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Ground-Truth, Whose Truth? -- Examining the Challenges with Annotating Toxic Text Datasets

Dec 07, 2021

Kofi Arhin, Ioana Baldini, Dennis Wei, Karthikeyan Natesan Ramamurthy, Moninder Singh

Figure 1 for Ground-Truth, Whose Truth? -- Examining the Challenges with Annotating Toxic Text Datasets

Figure 2 for Ground-Truth, Whose Truth? -- Examining the Challenges with Annotating Toxic Text Datasets

Figure 3 for Ground-Truth, Whose Truth? -- Examining the Challenges with Annotating Toxic Text Datasets

Figure 4 for Ground-Truth, Whose Truth? -- Examining the Challenges with Annotating Toxic Text Datasets

Share this with someone who'll enjoy it:

Abstract:The use of machine learning (ML)-based language models (LMs) to monitor content online is on the rise. For toxic text identification, task-specific fine-tuning of these models are performed using datasets labeled by annotators who provide ground-truth labels in an effort to distinguish between offensive and normal content. These projects have led to the development, improvement, and expansion of large datasets over time, and have contributed immensely to research on natural language. Despite the achievements, existing evidence suggests that ML models built on these datasets do not always result in desirable outcomes. Therefore, using a design science research (DSR) approach, this study examines selected toxic text datasets with the goal of shedding light on some of the inherent issues and contributing to discussions on navigating these challenges for existing and future projects. To achieve the goal of the study, we re-annotate samples from three toxic text datasets and find that a multi-label approach to annotating toxic text samples can help to improve dataset quality. While this approach may not improve the traditional metric of inter-annotator agreement, it may better capture dependence on context and diversity in annotators. We discuss the implications of these results for both theory and practice.

* 15 pages

View paper on

Share this with someone who'll enjoy it:

Title:Ground-Truth, Whose Truth? -- Examining the Challenges with Annotating Toxic Text Datasets

Paper and Code