Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Emma Pierson

Urban Incident Prediction with Graph Neural Networks: Integrating Government Ratings and Crowdsourced Reports

Jun 10, 2025

Sidhika Balachandar, Shuvom Sadhuka, Bonnie Berger, Emma Pierson, Nikhil Garg

Figure 1 for Urban Incident Prediction with Graph Neural Networks: Integrating Government Ratings and Crowdsourced Reports

Figure 2 for Urban Incident Prediction with Graph Neural Networks: Integrating Government Ratings and Crowdsourced Reports

Figure 3 for Urban Incident Prediction with Graph Neural Networks: Integrating Government Ratings and Crowdsourced Reports

Figure 4 for Urban Incident Prediction with Graph Neural Networks: Integrating Government Ratings and Crowdsourced Reports

Abstract:Graph neural networks (GNNs) are widely used in urban spatiotemporal forecasting, such as predicting infrastructure problems. In this setting, government officials wish to know in which neighborhoods incidents like potholes or rodent issues occur. The true state of incidents (e.g., street conditions) for each neighborhood is observed via government inspection ratings. However, these ratings are only conducted for a sparse set of neighborhoods and incident types. We also observe the state of incidents via crowdsourced reports, which are more densely observed but may be biased due to heterogeneous reporting behavior. First, for such settings, we propose a multiview, multioutput GNN-based model that uses both unbiased rating data and biased reporting data to predict the true latent state of incidents. Second, we investigate a case study of New York City urban incidents and collect, standardize, and make publicly available a dataset of 9,615,863 crowdsourced reports and 1,041,415 government inspection ratings over 3 years and across 139 types of incidents. Finally, we show on both real and semi-synthetic data that our model can better predict the latent state compared to models that use only reporting data or models that use only rating data, especially when rating data is sparse and reports are predictive of ratings. We also quantify demographic biases in crowdsourced reporting, e.g., higher-income neighborhoods report problems at higher rates. Our analysis showcases a widely applicable approach for latent state prediction using heterogeneous, sparse, and biased data.

Via

Access Paper or Ask Questions

Bayesian Modeling of Zero-Shot Classifications for Urban Flood Detection

Mar 18, 2025

Matt Franchi, Nikhil Garg, Wendy Ju, Emma Pierson

Figure 1 for Bayesian Modeling of Zero-Shot Classifications for Urban Flood Detection

Figure 2 for Bayesian Modeling of Zero-Shot Classifications for Urban Flood Detection

Figure 3 for Bayesian Modeling of Zero-Shot Classifications for Urban Flood Detection

Figure 4 for Bayesian Modeling of Zero-Shot Classifications for Urban Flood Detection

Abstract:Street scene datasets, collected from Street View or dashboard cameras, offer a promising means of detecting urban objects and incidents like street flooding. However, a major challenge in using these datasets is their lack of reliable labels: there are myriad types of incidents, many types occur rarely, and ground-truth measures of where incidents occur are lacking. Here, we propose BayFlood, a two-stage approach which circumvents this difficulty. First, we perform zero-shot classification of where incidents occur using a pretrained vision-language model (VLM). Second, we fit a spatial Bayesian model on the VLM classifications. The zero-shot approach avoids the need to annotate large training sets, and the Bayesian model provides frequent desiderata in urban settings - principled measures of uncertainty, smoothing across locations, and incorporation of external data like stormwater accumulation zones. We comprehensively validate this two-stage approach, showing that VLMs provide strong zero-shot signal for floods across multiple cities and time periods, the Bayesian model improves out-of-sample prediction relative to baseline methods, and our inferred flood risk correlates with known external predictors of risk. Having validated our approach, we show it can be used to improve urban flood detection: our analysis reveals 113,738 people who are at high risk of flooding overlooked by current methods, identifies demographic biases in existing methods, and suggests locations for new flood sensors. More broadly, our results showcase how Bayesian modeling of zero-shot LM annotations represents a promising paradigm because it avoids the need to collect large labeled datasets and leverages the power of foundation models while providing the expressiveness and uncertainty quantification of Bayesian models.

* In review

Via

Access Paper or Ask Questions

Sparse Autoencoders for Hypothesis Generation

Feb 05, 2025

Rajiv Movva, Kenny Peng, Nikhil Garg, Jon Kleinberg, Emma Pierson

Figure 1 for Sparse Autoencoders for Hypothesis Generation

Figure 2 for Sparse Autoencoders for Hypothesis Generation

Figure 3 for Sparse Autoencoders for Hypothesis Generation

Figure 4 for Sparse Autoencoders for Hypothesis Generation

Abstract:We describe HypotheSAEs, a general method to hypothesize interpretable relationships between text data (e.g., headlines) and a target variable (e.g., clicks). HypotheSAEs has three steps: (1) train a sparse autoencoder on text embeddings to produce interpretable features describing the data distribution, (2) select features that predict the target variable, and (3) generate a natural language interpretation of each feature (e.g., "mentions being surprised or shocked") using an LLM. Each interpretation serves as a hypothesis about what predicts the target variable. Compared to baselines, our method better identifies reference hypotheses on synthetic datasets (at least +0.06 in F1) and produces more predictive hypotheses on real datasets (~twice as many significant findings), despite requiring 1-2 orders of magnitude less compute than recent LLM-based methods. HypotheSAEs also produces novel discoveries on two well-studied tasks: explaining partisan differences in Congressional speeches and identifying drivers of engagement with online headlines.

* First two authors contributed equally; working paper

Via

Access Paper or Ask Questions

Evaluating multiple models using labeled and unlabeled data

Jan 21, 2025

Divya Shanmugam, Shuvom Sadhuka, Manish Raghavan, John Guttag, Bonnie Berger, Emma Pierson

Figure 1 for Evaluating multiple models using labeled and unlabeled data

Figure 2 for Evaluating multiple models using labeled and unlabeled data

Figure 3 for Evaluating multiple models using labeled and unlabeled data

Figure 4 for Evaluating multiple models using labeled and unlabeled data

Abstract:It remains difficult to evaluate machine learning classifiers in the absence of a large, labeled dataset. While labeled data can be prohibitively expensive or impossible to obtain, unlabeled data is plentiful. Here, we introduce Semi-Supervised Model Evaluation (SSME), a method that uses both labeled and unlabeled data to evaluate machine learning classifiers. SSME is the first evaluation method to take advantage of the fact that: (i) there are frequently multiple classifiers for the same task, (ii) continuous classifier scores are often available for all classes, and (iii) unlabeled data is often far more plentiful than labeled data. The key idea is to use a semi-supervised mixture model to estimate the joint distribution of ground truth labels and classifier predictions. We can then use this model to estimate any metric that is a function of classifier scores and ground truth labels (e.g., accuracy or expected calibration error). We present experiments in four domains where obtaining large labeled datasets is often impractical: (1) healthcare, (2) content moderation, (3) molecular property prediction, and (4) image annotation. Our results demonstrate that SSME estimates performance more accurately than do competing methods, reducing error by 5.1x relative to using labeled data alone and 2.4x relative to the next best competing method. SSME also improves accuracy when evaluating performance across subsets of the test distribution (e.g., specific demographic subgroups) and when evaluating the performance of language models.

Via

Access Paper or Ask Questions

Learning Disease Progression Models That Capture Health Disparities

Dec 20, 2024

Erica Chiang, Divya Shanmugam, Ashley N. Beecy, Gabriel Sayer, Nir Uriel, Deborah Estrin, Nikhil Garg, Emma Pierson

Figure 1 for Learning Disease Progression Models That Capture Health Disparities

Figure 2 for Learning Disease Progression Models That Capture Health Disparities

Figure 3 for Learning Disease Progression Models That Capture Health Disparities

Figure 4 for Learning Disease Progression Models That Capture Health Disparities

Abstract:Disease progression models are widely used to inform the diagnosis and treatment of many progressive diseases. However, a significant limitation of existing models is that they do not account for health disparities that can bias the observed data. To address this, we develop an interpretable Bayesian disease progression model that captures three key health disparities: certain patient populations may (1) start receiving care only when their disease is more severe, (2) experience faster disease progression even while receiving care, or (3) receive follow-up care less frequently conditional on disease severity. We show theoretically and empirically that failing to account for disparities produces biased estimates of severity (underestimating severity for disadvantaged groups, for example). On a dataset of heart failure patients, we show that our model can identify groups that face each type of health disparity, and that accounting for these disparities meaningfully shifts which patients are considered high-risk.

Via

Access Paper or Ask Questions

Generative AI in Medicine

Dec 13, 2024

Divya Shanmugam, Monica Agrawal, Rajiv Movva, Irene Y. Chen, Marzyeh Ghassemi, Emma Pierson

Abstract:The increased capabilities of generative AI have dramatically expanded its possible use cases in medicine. We provide a comprehensive overview of generative AI use cases for clinicians, patients, clinical trial organizers, researchers, and trainees. We then discuss the many challenges -- including maintaining privacy and security, improving transparency and interpretability, upholding equity, and rigorously evaluating models -- which must be overcome to realize this potential, and the open research directions they give rise to.

* To appear in the Annual Review of Biomedical Data Science, August 2025

Via

Access Paper or Ask Questions

Shaping AI's Impact on Billions of Lives

Dec 03, 2024

Mariano-Florentino Cuéllar, Jeff Dean, Finale Doshi-Velez, John Hennessy, Andy Konwinski, Sanmi Koyejo, Pelonomi Moiloa, Emma Pierson, David Patterson

Figure 1 for Shaping AI's Impact on Billions of Lives

Figure 2 for Shaping AI's Impact on Billions of Lives

Figure 3 for Shaping AI's Impact on Billions of Lives

Abstract:Artificial Intelligence (AI), like any transformative technology, has the potential to be a double-edged sword, leading either toward significant advancements or detrimental outcomes for society as a whole. As is often the case when it comes to widely-used technologies in market economies (e.g., cars and semiconductor chips), commercial interest tends to be the predominant guiding factor. The AI community is at risk of becoming polarized to either take a laissez-faire attitude toward AI development, or to call for government overregulation. Between these two poles we argue for the community of AI practitioners to consciously and proactively work for the common good. This paper offers a blueprint for a new type of innovation infrastructure including 18 concrete milestones to guide AI research in that direction. Our view is that we are still in the early days of practical AI, and focused efforts by practitioners, policymakers, and other stakeholders can still maximize the upsides of AI and minimize its downsides. We talked to luminaries such as recent Nobelist John Jumper on science, President Barack Obama on governance, former UN Ambassador and former National Security Advisor Susan Rice on security, philanthropist Eric Schmidt on several topics, and science fiction novelist Neal Stephenson on entertainment. This ongoing dialogue and collaborative effort has produced a comprehensive, realistic view of what the actual impact of AI could be, from a diverse assembly of thinkers with deep understanding of this technology and these domains. From these exchanges, five recurring guidelines emerged, which form the cornerstone of a framework for beginning to harness AI in service of the public good. They not only guide our efforts in discovery but also shape our approach to deploying this transformative technology responsibly and ethically.

Via

Access Paper or Ask Questions

LLMs generate structurally realistic social networks but overestimate political homophily

Aug 29, 2024

Serina Chang, Alicja Chaszczewicz, Emma Wang, Maya Josifovska, Emma Pierson, Jure Leskovec

Figure 1 for LLMs generate structurally realistic social networks but overestimate political homophily

Figure 2 for LLMs generate structurally realistic social networks but overestimate political homophily

Figure 3 for LLMs generate structurally realistic social networks but overestimate political homophily

Figure 4 for LLMs generate structurally realistic social networks but overestimate political homophily

Abstract:Generating social networks is essential for many applications, such as epidemic modeling and social simulations. Prior approaches either involve deep learning models, which require many observed networks for training, or stylized models, which are limited in their realism and flexibility. In contrast, LLMs offer the potential for zero-shot and flexible network generation. However, two key questions are: (1) are LLM's generated networks realistic, and (2) what are risks of bias, given the importance of demographics in forming social ties? To answer these questions, we develop three prompting methods for network generation and compare the generated networks to real social networks. We find that more realistic networks are generated with "local" methods, where the LLM constructs relations for one persona at a time, compared to "global" methods that construct the entire network at once. We also find that the generated networks match real networks on many characteristics, including density, clustering, community structure, and degree. However, we find that LLMs emphasize political homophily over all other types of homophily and overestimate political homophily relative to real-world measures.

Via

Access Paper or Ask Questions

Annotation alignment: Comparing LLM and human annotations of conversational safety

Jun 10, 2024

Rajiv Movva, Pang Wei Koh, Emma Pierson

Abstract:To what extent to do LLMs align with human perceptions of safety? We study this question via *annotation alignment*, the extent to which LLMs and humans agree when annotating the safety of user-chatbot conversations. We leverage the recent DICES dataset (Aroyo et al., 2023), in which 350 conversations are each rated for safety by 112 annotators spanning 10 race-gender groups. GPT-4 achieves a Pearson correlation of $r = 0.59$ with the average annotator rating, higher than the median annotator's correlation with the average ($r=0.51$). We show that larger datasets are needed to resolve whether GPT-4 exhibits disparities in how well it correlates with demographic groups. Also, there is substantial idiosyncratic variation in correlation *within* groups, suggesting that race & gender do not fully capture differences in alignment. Finally, we find that GPT-4 cannot predict when one demographic group finds a conversation more unsafe than another.

* Working draft, short paper. 5 pages, 1 figure

Via

Access Paper or Ask Questions

MEDIQ: Question-Asking LLMs for Adaptive and Reliable Clinical Reasoning

Jun 04, 2024

Shuyue Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan Ilgen, Emma Pierson, Pang Wei Koh, Yulia Tsvetkov

Figure 1 for MEDIQ: Question-Asking LLMs for Adaptive and Reliable Clinical Reasoning

Figure 2 for MEDIQ: Question-Asking LLMs for Adaptive and Reliable Clinical Reasoning

Figure 3 for MEDIQ: Question-Asking LLMs for Adaptive and Reliable Clinical Reasoning

Figure 4 for MEDIQ: Question-Asking LLMs for Adaptive and Reliable Clinical Reasoning

Abstract:In high-stakes domains like clinical reasoning, AI assistants powered by large language models (LLMs) are yet to be reliable and safe. We identify a key obstacle towards reliability: existing LLMs are trained to answer any question, even with incomplete context in the prompt or insufficient parametric knowledge. We propose to change this paradigm to develop more careful LLMs that ask follow-up questions to gather necessary and sufficient information and respond reliably. We introduce MEDIQ, a framework to simulate realistic clinical interactions, which incorporates a Patient System and an adaptive Expert System. The Patient may provide incomplete information in the beginning; the Expert refrains from making diagnostic decisions when unconfident, and instead elicits missing details from the Patient via follow-up questions. To evaluate MEDIQ, we convert MEDQA and CRAFT-MD -- medical benchmarks for diagnostic question answering -- into an interactive setup. We develop a reliable Patient system and prototype several Expert systems, first showing that directly prompting state-of-the-art LLMs to ask questions degrades the quality of clinical reasoning, indicating that adapting LLMs to interactive information-seeking settings is nontrivial. We then augment the Expert with a novel abstention module to better estimate model confidence and decide whether to ask more questions, thereby improving diagnostic accuracy by 20.3%; however, performance still lags compared to an (unrealistic in practice) upper bound when full information is given upfront. Further analyses reveal that interactive performance can be improved by filtering irrelevant contexts and reformatting conversations. Overall, our paper introduces a novel problem towards LLM reliability, a novel MEDIQ framework, and highlights important future directions to extend the information-seeking abilities of LLM assistants in critical domains.

* 29 pages, 12 figures

Via

Access Paper or Ask Questions