Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Darlene Neal

A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models

Mar 18, 2024

Stephen R. Pfohl, Heather Cole-Lewis, Rory Sayres, Darlene Neal, Mercy Asiedu, Awa Dieng, Nenad Tomasev, Qazi Mamunur Rashid, Shekoofeh Azizi, Negar Rostamzadeh(+20 more)

Abstract:Large language models (LLMs) hold immense promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. In this work, we present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and then conduct an empirical case study with Med-PaLM 2, resulting in the largest human evaluation study in this area to date. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases, and EquityMedQA, a collection of seven newly-released datasets comprising both manually-curated and LLM-generated questions enriched for adversarial queries. Both our human assessment framework and dataset design process are grounded in an iterative participatory approach and review of possible biases in Med-PaLM 2 answers to adversarial queries. Through our empirical study, we find that the use of a collection of datasets curated through a variety of methodologies, coupled with a thorough evaluation protocol that leverages multiple assessment rubric designs and diverse rater groups, surfaces biases that may be missed via narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. We emphasize that while our framework can identify specific forms of bias, it is not sufficient to holistically assess whether the deployment of an AI system promotes equitable health outcomes. We hope the broader community leverages and builds on these tools and methods towards realizing a shared goal of LLMs that promote accessible and equitable healthcare for all.

Via

Access Paper or Ask Questions

Towards Expert-Level Medical Question Answering with Large Language Models

May 16, 2023

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal(+21 more)

Figure 1 for Towards Expert-Level Medical Question Answering with Large Language Models

Figure 2 for Towards Expert-Level Medical Question Answering with Large Language Models

Figure 3 for Towards Expert-Level Medical Question Answering with Large Language Models

Figure 4 for Towards Expert-Level Medical Question Answering with Large Language Models

Abstract:Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering.

Via

Access Paper or Ask Questions

The Equitable AI Research Roundtable (EARR): Towards Community-Based Decision Making in Responsible AI Development

Mar 14, 2023

Jamila Smith-Loud, Andrew Smart, Darlene Neal, Amber Ebinama, Eric Corbett, Paul Nicholas, Qazi Rashid, Anne Peckham, Sarah Murphy-Gray, Nicole Morris(+7 more)

Abstract:This paper reports on our initial evaluation of The Equitable AI Research Roundtable -- a coalition of experts in law, education, community engagement, social justice, and technology. EARR was created in collaboration among a large tech firm, nonprofits, NGO research institutions, and universities to provide critical research based perspectives and feedback on technology's emergent ethical and social harms. Through semi-structured workshops and discussions within the large tech firm, EARR has provided critical perspectives and feedback on how to conceptualize equity and vulnerability as they relate to AI technology. We outline three principles in practice of how EARR has operated thus far that are especially relevant to the concerns of the FAccT community: how EARR expands the scope of expertise in AI development, how it fosters opportunities for epistemic curiosity and responsibility, and that it creates a space for mutual learning. This paper serves as both an analysis and translation of lessons learned through this engagement approach, and the possibilities for future research.

* 14 pages, 1 figure

Via

Access Paper or Ask Questions