Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tina Hernandez-Boussard

Securing Dual-Use Pathogen Data of Concern

Feb 08, 2026

Doni Bloomfield, Allison Berke, Moritz S. Hanke, Aaron Maiwald, James R. M. Black, Toby Webster, Tina Hernandez-Boussard, Oliver M. Crook, Jassi Pannu

Abstract:Training data is an essential input into creating competent artificial intelligence (AI) models. AI models for biology are trained on large volumes of data, including data related to biological sequences, structures, images, and functions. The type of data used to train a model is intimately tied to the capabilities it ultimately possesses--including those of biosecurity concern. For this reason, an international group of more than 100 researchers at the recent 50th anniversary Asilomar Conference endorsed data controls to prevent the use of AI for harmful applications such as bioweapons development. To help design such controls, we introduce a five-tier Biosecurity Data Level (BDL) framework for categorizing pathogen data. Each level contains specific data types, based on their expected ability to contribute to capabilities of concern when used to train AI models. For each BDL tier, we propose technical restrictions appropriate to its level of risk. Finally, we outline a novel governance framework for newly created dual-use pathogen data. In a world with widely accessible computational and coding resources, data controls may be among the most high-leverage interventions available to reduce the proliferation of concerning biological AI capabilities.

* 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Biosecurity Safeguards for Generative AI

Via

Access Paper or Ask Questions

Training-Free Adaptation of New-Generation LLMs using Legacy Clinical Models

Jan 06, 2026

Sasha Ronaghi, Chloe Stanwyck, Asad Aali, Amir Ronaghi, Miguel Fuentes, Tina Hernandez-Boussard, Emily Alsentzer

Abstract:Adapting language models to the clinical domain through continued pretraining and fine-tuning requires costly retraining for each new model generation. We propose Cross-Architecture Proxy Tuning (CAPT), a model-ensembling approach that enables training-free adaptation of state-of-the-art general-domain models using existing clinical models. CAPT supports models with disjoint vocabularies, leveraging contrastive decoding to selectively inject clinically relevant signals while preserving the general-domain model's reasoning and fluency. On six clinical classification and text-generation tasks, CAPT with a new-generation general-domain model and an older-generation clinical model consistently outperforms both models individually and state-of-the-art ensembling approaches (average +17.6% over UniTE, +41.4% over proxy tuning across tasks). Through token-level analysis and physician case studies, we demonstrate that CAPT amplifies clinically actionable language, reduces context errors, and increases clinical specificity.

* 29 pages, 3 figures

Via

Access Paper or Ask Questions

Challenges and recommendations for Electronic Health Records data extraction and preparation for dynamic prediction modelling in hospitalized patients -- a practical guide

Jan 17, 2025

Elena Albu, Shan Gao, Pieter Stijnen, Frank E. Rademakers, Bas C T van Bussel, Taya Collyer, Tina Hernandez-Boussard, Laure Wynants, Ben Van Calster

Abstract:Dynamic predictive modeling using electronic health record (EHR) data has gained significant attention in recent years. The reliability and trustworthiness of such models depend heavily on the quality of the underlying data, which is largely determined by the stages preceding the model development: data extraction from EHR systems and data preparation. We list over forty challenges encountered during these stages and provide actionable recommendations for addressing them. These challenges are organized into four categories: cohort definition, outcome definition, feature engineering, and data cleaning. This list is designed to serve as a practical guide for data extraction engineers and researchers, supporting better practices and improving the quality and real-world applicability of dynamic prediction models in clinical settings.

Via

Access Paper or Ask Questions

Unveiling and Mitigating Bias in Mental Health Analysis with Large Language Models

Jun 17, 2024

Yuqing Wang, Yun Zhao, Sara Alessandra Keller, Anne de Hond, Marieke M. van Buchem, Malvika Pillai, Tina Hernandez-Boussard

Figure 1 for Unveiling and Mitigating Bias in Mental Health Analysis with Large Language Models

Figure 2 for Unveiling and Mitigating Bias in Mental Health Analysis with Large Language Models

Figure 3 for Unveiling and Mitigating Bias in Mental Health Analysis with Large Language Models

Figure 4 for Unveiling and Mitigating Bias in Mental Health Analysis with Large Language Models

Abstract:The advancement of large language models (LLMs) has demonstrated strong capabilities across various applications, including mental health analysis. However, existing studies have focused on predictive performance, leaving the critical issue of fairness underexplored, posing significant risks to vulnerable populations. Despite acknowledging potential biases, previous works have lacked thorough investigations into these biases and their impacts. To address this gap, we systematically evaluate biases across seven social factors (e.g., gender, age, religion) using ten LLMs with different prompting methods on eight diverse mental health datasets. Our results show that GPT-4 achieves the best overall balance in performance and fairness among LLMs, although it still lags behind domain-specific models like MentalRoBERTa in some cases. Additionally, our tailored fairness-aware prompts can effectively mitigate bias in mental health predictions, highlighting the great potential for fair analysis in this field.

* In submission; Data and code are available at: https://github.com/EternityYW/BiasEval-LLM-MentalHealth

Via

Access Paper or Ask Questions

FairEHR-CLP: Towards Fairness-Aware Clinical Predictions with Contrastive Learning in Multimodal Electronic Health Records

Feb 01, 2024

Yuqing Wang, Malvika Pillai, Yun Zhao, Catherine Curtin, Tina Hernandez-Boussard

Abstract:In the high-stakes realm of healthcare, ensuring fairness in predictive models is crucial. Electronic Health Records (EHRs) have become integral to medical decision-making, yet existing methods for enhancing model fairness restrict themselves to unimodal data and fail to address the multifaceted social biases intertwined with demographic factors in EHRs. To mitigate these biases, we present FairEHR-CLP: a general framework for Fairness-aware Clinical Predictions with Contrastive Learning in EHRs. FairEHR-CLP operates through a two-stage process, utilizing patient demographics, longitudinal data, and clinical notes. First, synthetic counterparts are generated for each patient, allowing for diverse demographic identities while preserving essential health information. Second, fairness-aware predictions employ contrastive learning to align patient representations across sensitive attributes, jointly optimized with an MLP classifier with a softmax layer for clinical classification tasks. Acknowledging the unique challenges in EHRs, such as varying group sizes and class imbalance, we introduce a novel fairness metric to effectively measure error rate disparities across subgroups. Extensive experiments on three diverse EHR datasets on three tasks demonstrate the effectiveness of FairEHR-CLP in terms of fairness and utility compared with competitive baselines. FairEHR-CLP represents an advancement towards ensuring both accuracy and equity in predictive healthcare models.

* 16 pages, in submission

Via

Access Paper or Ask Questions

Natural Language Processing Methods to Identify Oncology Patients at High Risk for Acute Care with Clinical Notes

Sep 28, 2022

Claudio Fanconi, Marieke van Buchem, Tina Hernandez-Boussard

Figure 1 for Natural Language Processing Methods to Identify Oncology Patients at High Risk for Acute Care with Clinical Notes

Figure 2 for Natural Language Processing Methods to Identify Oncology Patients at High Risk for Acute Care with Clinical Notes

Figure 3 for Natural Language Processing Methods to Identify Oncology Patients at High Risk for Acute Care with Clinical Notes

Figure 4 for Natural Language Processing Methods to Identify Oncology Patients at High Risk for Acute Care with Clinical Notes

Abstract:Clinical notes are an essential component of a health record. This paper evaluates how natural language processing (NLP) can be used to identify the risk of acute care use (ACU) in oncology patients, once chemotherapy starts. Risk prediction using structured health data (SHD) is now standard, but predictions using free-text formats are complex. This paper explores the use of free-text notes for the prediction of ACU instead of SHD. Deep Learning models were compared to manually engineered language features. Results show that SHD models minimally outperform NLP models; an l1-penalised logistic regression with SHD achieved a C-statistic of 0.748 (95%-CI: 0.735, 0.762), while the same model with language features achieved 0.730 (95%-CI: 0.717, 0.745) and a transformer-based model achieved 0.702 (95%-CI: 0.688, 0.717). This paper shows how language models can be used in clinical applications and underlines how risk bias is different for diverse patient groups, even using only free-text data.

* 11 pages, 6 figures, 2 tables

Via

Access Paper or Ask Questions