Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jennifer A. Thompson

GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction

May 24, 2024

Virginia K. Felkner, Jennifer A. Thompson, Jonathan May

Figure 1 for GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction

Figure 2 for GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction

Figure 3 for GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction

Figure 4 for GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction

Abstract:Social biases in LLMs are usually measured via bias benchmark datasets. Current benchmarks have limitations in scope, grounding, quality, and human effort required. Previous work has shown success with a community-sourced, rather than crowd-sourced, approach to benchmark development. However, this work still required considerable effort from annotators with relevant lived experience. This paper explores whether an LLM (specifically, GPT-3.5-Turbo) can assist with the task of developing a bias benchmark dataset from responses to an open-ended community survey. We also extend the previous work to a new community and set of biases: the Jewish community and antisemitism. Our analysis shows that GPT-3.5-Turbo has poor performance on this annotation task and produces unacceptable quality issues in its output. Thus, we conclude that GPT-3.5-Turbo is not an appropriate substitute for human annotation in sensitive tasks related to social biases, and that its use actually negates many of the benefits of community-sourcing bias benchmarks.

* Accepted to ACL 2024 (main conference)

Via

Access Paper or Ask Questions