Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Eve Fleisig

GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration

Feb 27, 2025

Yoo Yeon Sung, Eve Fleisig, Yu Hou, Ishan Upadhyay, Jordan Lee Boyd-Graber

Abstract:Language models are often miscalibrated, leading to confidently incorrect answers. We introduce GRACE, a benchmark for language model calibration that incorporates comparison with human calibration. GRACE consists of question-answer pairs, in which each question contains a series of clues that gradually become easier, all leading to the same answer; models must answer correctly as early as possible as the clues are revealed. This setting permits granular measurement of model calibration based on how early, accurately, and confidently a model answers. After collecting these questions, we host live human vs. model competitions to gather 1,749 data points on human and model teams' timing, accuracy, and confidence. We propose a metric, CalScore, that uses GRACE to analyze model calibration errors and identify types of model miscalibration that differ from human behavior. We find that although humans are less accurate than models, humans are generally better calibrated. Since state-of-the-art models struggle on GRACE, it effectively evaluates progress on improving model calibration.

Via

Access Paper or Ask Questions

Accurate and Data-Efficient Toxicity Prediction when Annotators Disagree

Oct 16, 2024

Harbani Jaggi, Kashyap Murali, Eve Fleisig, Erdem Bıyık

Figure 1 for Accurate and Data-Efficient Toxicity Prediction when Annotators Disagree

Figure 2 for Accurate and Data-Efficient Toxicity Prediction when Annotators Disagree

Figure 3 for Accurate and Data-Efficient Toxicity Prediction when Annotators Disagree

Figure 4 for Accurate and Data-Efficient Toxicity Prediction when Annotators Disagree

Abstract:When annotators disagree, predicting the labels given by individual annotators can capture nuances overlooked by traditional label aggregation. We introduce three approaches to predicting individual annotator ratings on the toxicity of text by incorporating individual annotator-specific information: a neural collaborative filtering (NCF) approach, an in-context learning (ICL) approach, and an intermediate embedding-based architecture. We also study the utility of demographic information for rating prediction. NCF showed limited utility; however, integrating annotator history, demographics, and survey information permits both the embedding-based architecture and ICL to substantially improve prediction accuracy, with the embedding-based architecture outperforming the other methods. We also find that, if demographics are predicted from survey information, using these imputed demographics as features performs comparably to using true demographic data. This suggests that demographics may not provide substantial information for modeling ratings beyond what is captured in survey responses. Our findings raise considerations about the relative utility of different types of annotator information and provide new approaches for modeling annotators in subjective NLP tasks.

Via

Access Paper or Ask Questions

ADVSCORE: A Metric for the Evaluation and Creation of Adversarial Benchmarks

Jun 24, 2024

Yoo Yeon Sung, Eve Fleisig, Ishani Mondal, Jordan Lee Boyd-Graber

Figure 1 for ADVSCORE: A Metric for the Evaluation and Creation of Adversarial Benchmarks

Figure 2 for ADVSCORE: A Metric for the Evaluation and Creation of Adversarial Benchmarks

Figure 3 for ADVSCORE: A Metric for the Evaluation and Creation of Adversarial Benchmarks

Figure 4 for ADVSCORE: A Metric for the Evaluation and Creation of Adversarial Benchmarks

Abstract:Adversarial benchmarks validate model abilities by providing samples that fool models but not humans. However, despite the proliferation of datasets that claim to be adversarial, there does not exist an established metric to evaluate how adversarial these datasets are. To address this lacuna, we introduce ADVSCORE, a metric which quantifies how adversarial and discriminative an adversarial dataset is and exposes the features that make data adversarial. We then use ADVSCORE to underpin a dataset creation pipeline that incentivizes writing a high-quality adversarial dataset. As a proof of concept, we use ADVSCORE to collect an adversarial question answering (QA) dataset, ADVQA, from our pipeline. The high-quality questions in ADVQA surpasses three adversarial benchmarks across domains at fooling several models but not humans. We validate our result based on difficulty estimates from 9,347 human responses on four datasets and predictions from three models. Moreover, ADVSCORE uncovers which adversarial tactics used by human writers fool models (e.g., GPT-4) but not humans. Through ADVSCORE and its analyses, we offer guidance on revealing language model vulnerabilities and producing reliable adversarial examples.

* arXiv admin note: substantial text overlap with arXiv:2401.11185

Via

Access Paper or Ask Questions

Standard Language Ideology in AI-Generated Language

Jun 13, 2024

Genevieve Smith, Eve Fleisig, Madeline Bossi, Ishita Rustagi, Xavier Yin

Figure 1 for Standard Language Ideology in AI-Generated Language

Abstract:In this position paper, we explore standard language ideology in language generated by large language models (LLMs). First, we outline how standard language ideology is reflected and reinforced in LLMs. We then present a taxonomy of open problems regarding standard language ideology in AI-generated language with implications for minoritized language communities. We introduce the concept of standard AI-generated language ideology, the process by which AI-generated language regards Standard American English (SAE) as a linguistic default and reinforces a linguistic bias that SAE is the most "appropriate" language. Finally, we discuss tensions that remain, including reflecting on what desirable system behavior looks like, as well as advantages and drawbacks of generative AI tools imitating--or often not--different English language varieties. Throughout, we discuss standard language ideology as a manifestation of existing global power structures in and through AI-generated language before ending with questions to move towards alternative, more emancipatory digital futures.

Via

Access Paper or Ask Questions

Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination

Jun 13, 2024

Eve Fleisig, Genevieve Smith, Madeline Bossi, Ishita Rustagi, Xavier Yin, Dan Klein

Figure 1 for Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination

Figure 2 for Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination

Figure 3 for Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination

Figure 4 for Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination

Abstract:We present a large-scale study of linguistic bias exhibited by ChatGPT covering ten dialects of English (Standard American English, Standard British English, and eight widely spoken non-"standard" varieties from around the world). We prompted GPT-3.5 Turbo and GPT-4 with text by native speakers of each variety and analyzed the responses via detailed linguistic feature annotation and native speaker evaluation. We find that the models default to "standard" varieties of English; based on evaluation by native speakers, we also find that model responses to non-"standard" varieties consistently exhibit a range of issues: lack of comprehension (10% worse compared to "standard" varieties), stereotyping (16% worse), demeaning content (22% worse), and condescending responses (12% worse). We also find that if these models are asked to imitate the writing style of prompts in non-"standard" varieties, they produce text that exhibits lower comprehension of the input and is especially prone to stereotyping. GPT-4 improves on GPT-3.5 in terms of comprehension, warmth, and friendliness, but it also results in a marked increase in stereotyping (+17%). The results suggest that GPT-3.5 Turbo and GPT-4 exhibit linguistic discrimination in ways that can exacerbate harms for speakers of non-"standard" varieties.

Via

Access Paper or Ask Questions

The Perspectivist Paradigm Shift: Assumptions and Challenges of Capturing Human Labels

May 09, 2024

Eve Fleisig, Su Lin Blodgett, Dan Klein, Zeerak Talat

Abstract:Longstanding data labeling practices in machine learning involve collecting and aggregating labels from multiple annotators. But what should we do when annotators disagree? Though annotator disagreement has long been seen as a problem to minimize, new perspectivist approaches challenge this assumption by treating disagreement as a valuable source of information. In this position paper, we examine practices and assumptions surrounding the causes of disagreement--some challenged by perspectivist approaches, and some that remain to be addressed--as well as practical and normative challenges for work operating under these assumptions. We conclude with recommendations for the data labeling pipeline and avenues for future research engaging with subjectivity and disagreement.

Via

Access Paper or Ask Questions

Mapping Social Choice Theory to RLHF

Apr 19, 2024

Jessica Dai, Eve Fleisig

Abstract:Recent work on the limitations of using reinforcement learning from human feedback (RLHF) to incorporate human preferences into model behavior often raises social choice theory as a reference point. Social choice theory's analysis of settings such as voting mechanisms provides technical infrastructure that can inform how to aggregate human preferences amid disagreement. We analyze the problem settings of social choice and RLHF, identify key differences between them, and discuss how these differences may affect the RLHF interpretation of well-known technical results in social choice.

Via

Access Paper or Ask Questions

Incorporating Worker Perspectives into MTurk Annotation Practices for NLP

Nov 16, 2023

Olivia Huang, Eve Fleisig, Dan Klein

Abstract:Current practices regarding data collection for natural language processing on Amazon Mechanical Turk (MTurk) often rely on a combination of studies on data quality and heuristics shared among NLP researchers. However, without considering the perspectives of MTurk workers, these approaches are susceptible to issues regarding workers' rights and poor response quality. We conducted a critical literature review and a survey of MTurk workers aimed at addressing open questions regarding best practices for fair payment, worker privacy, data quality, and considering worker incentives. We found that worker preferences are often at odds with received wisdom among NLP researchers. Surveyed workers preferred reliable, reasonable payments over uncertain, very high payments; reported frequently lying on demographic questions; and expressed frustration at having work rejected with no explanation. We also found that workers view some quality control methods, such as requiring minimum response times or Master's qualifications, as biased and largely ineffective. Based on the survey results, we provide recommendations on how future NLP studies may better account for MTurk workers' experiences in order to respect workers' rights and improve data quality.

Via

Access Paper or Ask Questions

First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models

Nov 08, 2023

Naomi Saphra, Eve Fleisig, Kyunghyun Cho, Adam Lopez

Figure 1 for First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models

Figure 2 for First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models

Abstract:Many NLP researchers are experiencing an existential crisis triggered by the astonishing success of ChatGPT and other systems based on large language models (LLMs). After such a disruptive change to our understanding of the field, what is left to do? Taking a historical lens, we look for guidance from the first era of LLMs, which began in 2005 with large $n$-gram models for machine translation. We identify durable lessons from the first era, and more importantly, we identify evergreen problems where NLP researchers can continue to make meaningful contributions in areas where LLMs are ascendant. Among these lessons, we discuss the primacy of hardware advancement in shaping the availability and importance of scale, as well as the urgent challenge of quality evaluation, both automated and human. We argue that disparities in scale are transient and that researchers can work to reduce them; that data, rather than hardware, is still a bottleneck for many meaningful applications; that meaningful evaluation informed by actual use is still an open problem; and that there is still room for speculative approaches.

Via

Access Paper or Ask Questions

When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks

May 24, 2023

Eve Fleisig, Rediet Abebe, Dan Klein

Figure 1 for When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks

Figure 2 for When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks

Figure 3 for When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks

Figure 4 for When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks

Abstract:Though majority vote among annotators is typically used for ground truth labels in natural language processing, annotator disagreement in tasks such as hate speech detection may reflect differences in opinion across groups, not noise. Thus, a crucial problem in hate speech detection is determining whether a statement is offensive to the demographic group that it targets, when that group may constitute a small fraction of the annotator pool. We construct a model that predicts individual annotator ratings on potentially offensive text and combines this information with the predicted target group of the text to model the opinions of target group members. We show gains across a range of metrics, including raising performance over the baseline by 22% at predicting individual annotators' ratings and by 33% at predicting variance among annotators, which provides a metric for model uncertainty downstream. We find that annotator ratings can be predicted using their demographic information and opinions on online content, without the need to track identifying annotator IDs that link each annotator to their ratings. We also find that use of non-invasive survey questions on annotators' online experiences helps to maximize privacy and minimize unnecessary collection of demographic information when predicting annotators' opinions.

Via

Access Paper or Ask Questions