Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Manish Raghavan

Double Machine Learning for Causal Inference under Shared-State Interference

Apr 10, 2025

Chris Hays, Manish Raghavan

Abstract:Researchers and practitioners often wish to measure treatment effects in settings where units interact via markets and recommendation systems. In these settings, units are affected by certain shared states, like prices, algorithmic recommendations or social signals. We formalize this structure, calling it shared-state interference, and argue that our formulation captures many relevant applied settings. Our key modeling assumption is that individuals' potential outcomes are independent conditional on the shared state. We then prove an extension of a double machine learning (DML) theorem providing conditions for achieving efficient inference under shared-state interference. We also instantiate our general theorem in several models of interest where it is possible to efficiently estimate the average direct effect (ADE) or global average treatment effect (GATE).

* 48 pages, 6 figures

Via

Access Paper or Ask Questions

Evaluating multiple models using labeled and unlabeled data

Jan 21, 2025

Divya Shanmugam, Shuvom Sadhuka, Manish Raghavan, John Guttag, Bonnie Berger, Emma Pierson

Figure 1 for Evaluating multiple models using labeled and unlabeled data

Figure 2 for Evaluating multiple models using labeled and unlabeled data

Figure 3 for Evaluating multiple models using labeled and unlabeled data

Figure 4 for Evaluating multiple models using labeled and unlabeled data

Abstract:It remains difficult to evaluate machine learning classifiers in the absence of a large, labeled dataset. While labeled data can be prohibitively expensive or impossible to obtain, unlabeled data is plentiful. Here, we introduce Semi-Supervised Model Evaluation (SSME), a method that uses both labeled and unlabeled data to evaluate machine learning classifiers. SSME is the first evaluation method to take advantage of the fact that: (i) there are frequently multiple classifiers for the same task, (ii) continuous classifier scores are often available for all classes, and (iii) unlabeled data is often far more plentiful than labeled data. The key idea is to use a semi-supervised mixture model to estimate the joint distribution of ground truth labels and classifier predictions. We can then use this model to estimate any metric that is a function of classifier scores and ground truth labels (e.g., accuracy or expected calibration error). We present experiments in four domains where obtaining large labeled datasets is often impractical: (1) healthcare, (2) content moderation, (3) molecular property prediction, and (4) image annotation. Our results demonstrate that SSME estimates performance more accurately than do competing methods, reducing error by 5.1x relative to using labeled data alone and 2.4x relative to the next best competing method. SSME also improves accuracy when evaluating performance across subsets of the test distribution (e.g., specific demographic subgroups) and when evaluating the performance of language models.

Via

Access Paper or Ask Questions

Competition and Diversity in Generative AI

Dec 11, 2024

Manish Raghavan

Abstract:Recent evidence suggests that the use of generative artificial intelligence reduces the diversity of content produced. In this work, we develop a game-theoretic model to explore the downstream consequences of content homogeneity when producers use generative AI to compete with one another. At equilibrium, players indeed produce content that is less diverse than optimal. However, stronger competition mitigates homogeneity and induces more diverse production. Perhaps more surprisingly, we show that a generative AI model that performs well in isolation (i.e., according to a benchmark) may fail to do so when faced with competition, and vice versa. We validate our results empirically by using language models to play Scattergories, a word game in which players are rewarded for producing answers that are both correct and unique. We discuss how the interplay between competition and homogeneity has implications for the development, evaluation, and use of generative AI.

Via

Access Paper or Ask Questions

Integrating Expert Judgment and Algorithmic Decision Making: An Indistinguishability Framework

Oct 11, 2024

Rohan Alur, Loren Laine, Darrick K. Li, Dennis Shung, Manish Raghavan, Devavrat Shah

Figure 1 for Integrating Expert Judgment and Algorithmic Decision Making: An Indistinguishability Framework

Figure 2 for Integrating Expert Judgment and Algorithmic Decision Making: An Indistinguishability Framework

Figure 3 for Integrating Expert Judgment and Algorithmic Decision Making: An Indistinguishability Framework

Figure 4 for Integrating Expert Judgment and Algorithmic Decision Making: An Indistinguishability Framework

Abstract:We introduce a novel framework for human-AI collaboration in prediction and decision tasks. Our approach leverages human judgment to distinguish inputs which are algorithmically indistinguishable, or "look the same" to any feasible predictive algorithm. We argue that this framing clarifies the problem of human-AI collaboration in prediction and decision tasks, as experts often form judgments by drawing on information which is not encoded in an algorithm's training data. Algorithmic indistinguishability yields a natural test for assessing whether experts incorporate this kind of "side information", and further provides a simple but principled method for selectively incorporating human feedback into algorithmic predictions. We show that this method provably improves the performance of any feasible algorithmic predictor and precisely quantify this improvement. We demonstrate the utility of our framework in a case study of emergency room triage decisions, where we find that although algorithmic risk scores are highly competitive with physicians, there is strong evidence that physician judgments provide signal which could not be replicated by any predictive algorithm. This insight yields a range of natural decision rules which leverage the complementary strengths of human experts and predictive algorithms.

* arXiv admin note: substantial text overlap with arXiv:2402.00793

Via

Access Paper or Ask Questions

Unstable Unlearning: The Hidden Risk of Concept Resurgence in Diffusion Models

Oct 10, 2024

Vinith M. Suriyakumar, Rohan Alur, Ayush Sekhari, Manish Raghavan, Ashia C. Wilson

Abstract:Text-to-image diffusion models rely on massive, web-scale datasets. Training them from scratch is computationally expensive, and as a result, developers often prefer to make incremental updates to existing models. These updates often compose fine-tuning steps (to learn new concepts or improve model performance) with "unlearning" steps (to "forget" existing concepts, such as copyrighted works or explicit content). In this work, we demonstrate a critical and previously unknown vulnerability that arises in this paradigm: even under benign, non-adversarial conditions, fine-tuning a text-to-image diffusion model on seemingly unrelated images can cause it to "relearn" concepts that were previously "unlearned." We comprehensively investigate the causes and scope of this phenomenon, which we term concept resurgence, by performing a series of experiments which compose "mass concept erasure" (the current state of the art for unlearning in text-to-image diffusion models (Lu et al., 2024)) with subsequent fine-tuning of Stable Diffusion v1.4. Our findings underscore the fragility of composing incremental model updates, and raise serious new concerns about current approaches to ensuring the safety and alignment of text-to-image diffusion models.

* 20 pages, 13 figures

Via

Access Paper or Ask Questions

Distinguishing the Indistinguishable: Human Expertise in Algorithmic Prediction

Feb 01, 2024

Rohan Alur, Manish Raghavan, Devavrat Shah

Figure 1 for Distinguishing the Indistinguishable: Human Expertise in Algorithmic Prediction

Figure 2 for Distinguishing the Indistinguishable: Human Expertise in Algorithmic Prediction

Figure 3 for Distinguishing the Indistinguishable: Human Expertise in Algorithmic Prediction

Figure 4 for Distinguishing the Indistinguishable: Human Expertise in Algorithmic Prediction

Abstract:We introduce a novel framework for incorporating human expertise into algorithmic predictions. Our approach focuses on the use of human judgment to distinguish inputs which `look the same' to any feasible predictive algorithm. We argue that this framing clarifies the problem of human/AI collaboration in prediction tasks, as experts often have access to information -- particularly subjective information -- which is not encoded in the algorithm's training data. We use this insight to develop a set of principled algorithms for selectively incorporating human feedback only when it improves the performance of any feasible predictor. We find empirically that although algorithms often outperform their human counterparts on average, human judgment can significantly improve algorithmic predictions on specific instances (which can be identified ex-ante). In an X-ray classification task, we find that this subset constitutes nearly 30% of the patient population. Our approach provides a natural way of uncovering this heterogeneity and thus enabling effective human-AI collaboration.

Via

Access Paper or Ask Questions

Reconciling the accuracy-diversity trade-off in recommendations

Jul 27, 2023

Kenny Peng, Manish Raghavan, Emma Pierson, Jon Kleinberg, Nikhil Garg

Abstract:In recommendation settings, there is an apparent trade-off between the goals of accuracy (to recommend items a user is most likely to want) and diversity (to recommend items representing a range of categories). As such, real-world recommender systems often explicitly incorporate diversity separately from accuracy. This approach, however, leaves a basic question unanswered: Why is there a trade-off in the first place? We show how the trade-off can be explained via a user's consumption constraints -- users typically only consume a few of the items they are recommended. In a stylized model we introduce, objectives that account for this constraint induce diverse recommendations, while objectives that do not account for this constraint induce homogeneous recommendations. This suggests that accuracy and diversity appear misaligned because standard accuracy metrics do not consider consumption constraints. Our model yields precise and interpretable characterizations of diversity in different settings, giving practical insights into the design of diverse recommendations.

* 34 pages, 5 figures

Via

Access Paper or Ask Questions

Auditing for Human Expertise

Jun 02, 2023

Rohan Alur, Loren Laine, Darrick K. Li, Manish Raghavan, Devavrat Shah, Dennis Shung

Abstract:High-stakes prediction tasks (e.g., patient diagnosis) are often handled by trained human experts. A common source of concern about automation in these settings is that experts may exercise intuition that is difficult to model and/or have access to information (e.g., conversations with a patient) that is simply unavailable to a would-be algorithm. This raises a natural question whether human experts add value which could not be captured by an algorithmic predictor. We develop a statistical framework under which we can pose this question as a natural hypothesis test. Indeed, as our framework highlights, detecting human expertise is more subtle than simply comparing the accuracy of expert predictions to those made by a particular learning algorithm. Instead, we propose a simple procedure which tests whether expert predictions are statistically independent from the outcomes of interest after conditioning on the available inputs (`features'). A rejection of our test thus suggests that human experts may add value to any algorithm trained on the available data, and has direct implications for whether human-AI `complementarity' is achievable in a given prediction task. We highlight the utility of our procedure using admissions data collected from the emergency department of a large academic hospital system, where we show that physicians' admit/discharge decisions for patients with acute gastrointestinal bleeding (AGIB) appear to be incorporating information not captured in a standard algorithmic screening tool. This is despite the fact that the screening tool is arguably more accurate than physicians' discretionary decisions, highlighting that -- even absent normative concerns about accountability or interpretability -- accuracy is insufficient to justify algorithmic automation.

* 27 pages, 8 figures

Via

Access Paper or Ask Questions

Simplistic Collection and Labeling Practices Limit the Utility of Benchmark Datasets for Twitter Bot Detection

Jan 17, 2023

Chris Hays, Zachary Schutzman, Manish Raghavan, Erin Walk, Philipp Zimmer

Figure 1 for Simplistic Collection and Labeling Practices Limit the Utility of Benchmark Datasets for Twitter Bot Detection

Figure 2 for Simplistic Collection and Labeling Practices Limit the Utility of Benchmark Datasets for Twitter Bot Detection

Figure 3 for Simplistic Collection and Labeling Practices Limit the Utility of Benchmark Datasets for Twitter Bot Detection

Figure 4 for Simplistic Collection and Labeling Practices Limit the Utility of Benchmark Datasets for Twitter Bot Detection

Abstract:Accurate bot detection is necessary for the safety and integrity of online platforms. It is also crucial for research on the influence of bots in elections, the spread of misinformation, and financial market manipulation. Platforms deploy infrastructure to flag or remove automated accounts, but their tools and data are not publicly available. Thus, the public must rely on third-party bot detection. These tools employ machine learning and often achieve near perfect performance for classification on existing datasets, suggesting bot detection is accurate, reliable and fit for use in downstream applications. We provide evidence that this is not the case and show that high performance is attributable to limitations in dataset collection and labeling rather than sophistication of the tools. Specifically, we show that simple decision rules -- shallow decision trees trained on a small number of features -- achieve near-state-of-the-art performance on most available datasets and that bot detection datasets, even when combined together, do not generalize well to out-of-sample datasets. Our findings reveal that predictions are highly dependent on each dataset's collection and labeling procedures rather than fundamental differences between bots and humans. These results have important implications for both transparency in sampling and labeling procedures and potential biases in research using existing bot detection tools for pre-processing.

* 10 pages, 6 figures

Via

Access Paper or Ask Questions

Fairness On The Ground: Applying Algorithmic Fairness Approaches to Production Systems

Mar 24, 2021

Chloé Bakalar, Renata Barreto, Stevie Bergman, Miranda Bogen, Bobbie Chern, Sam Corbett-Davies, Melissa Hall, Isabel Kloumann, Michelle Lam, Joaquin Quiñonero Candela(+6 more)

Figure 1 for Fairness On The Ground: Applying Algorithmic Fairness Approaches to Production Systems

Figure 2 for Fairness On The Ground: Applying Algorithmic Fairness Approaches to Production Systems

Abstract:Many technical approaches have been proposed for ensuring that decisions made by machine learning systems are fair, but few of these proposals have been stress-tested in real-world systems. This paper presents an example of one team's approach to the challenge of applying algorithmic fairness approaches to complex production systems within the context of a large technology company. We discuss how we disentangle normative questions of product and policy design (like, "how should the system trade off between different stakeholders' interests and needs?") from empirical questions of system implementation (like, "is the system achieving the desired tradeoff in practice?"). We also present an approach for answering questions of the latter sort, which allows us to measure how machine learning systems and human labelers are making these tradeoffs across different relevant groups. We hope our experience integrating fairness tools and approaches into large-scale and complex production systems will be useful to other practitioners facing similar challenges, and illuminating to academics and researchers looking to better address the needs of practitioners.

* 12 pages, 2 figures

Via

Access Paper or Ask Questions