Intelligent Critical Care Center, University of Florida, Gainesville, FL, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL
Abstract:Given the sheer volume of surgical procedures and the significant rate of postoperative fatalities, assessing and managing surgical complications has become a critical public health concern. Existing artificial intelligence (AI) tools for risk surveillance and diagnosis often lack adequate interpretability, fairness, and reproducibility. To address this, we proposed an Explainable AI (XAI) framework designed to answer five critical questions: why, why not, how, what if, and what else, with the goal of enhancing the explainability and transparency of AI models. We incorporated various techniques such as Local Interpretable Model-agnostic Explanations (LIME), SHapley Additive exPlanations (SHAP), counterfactual explanations, model cards, an interactive feature manipulation interface, and the identification of similar patients to address these questions. We showcased an XAI interface prototype that adheres to this framework for predicting major postoperative complications. This initial implementation has provided valuable insights into the vast explanatory potential of our XAI framework and represents an initial step towards its clinical adoption.
Abstract:Modern electronic health records (EHRs) hold immense promise in tracking personalized patient health trajectories through sequential deep learning, owing to their extensive breadth, scale, and temporal granularity. Nonetheless, how to effectively leverage multiple modalities from EHRs poses significant challenges, given its complex characteristics such as high dimensionality, multimodality, sparsity, varied recording frequencies, and temporal irregularities. To this end, this paper introduces a novel multimodal contrastive learning framework, specifically focusing on medical time series and clinical notes. To tackle the challenge of sparsity and irregular time intervals in medical time series, the framework integrates temporal cross-attention transformers with a dynamic embedding and tokenization scheme for learning multimodal feature representations. To harness the interconnected relationships between medical time series and clinical notes, the framework equips a global contrastive loss, aligning a patient's multimodal feature representations with the corresponding discharge summaries. Since discharge summaries uniquely pertain to individual patients and represent a holistic view of the patient's hospital stay, machine learning models are led to learn discriminative multimodal features via global contrasting. Extensive experiments with a real-world EHR dataset demonstrated that our framework outperformed state-of-the-art approaches on the exemplar task of predicting the occurrence of nine postoperative complications for more than 120,000 major inpatient surgeries using multimodal data from UF health system split among three hospitals (UF Health Gainesville, UF Health Jacksonville, and UF Health Jacksonville-North).
Abstract:Background: The accurate prediction of postoperative complication risk using Electronic Health Records (EHR) and artificial intelligence shows great potential. Training a robust artificial intelligence model typically requires large-scale and diverse datasets. In reality, collecting medical data often encounters challenges surrounding privacy protection. Methods: This retrospective cohort study includes adult patients who were admitted to UFH Gainesville (GNV) (n = 79,850) and Jacksonville (JAX) (n = 28,636) for any type of inpatient surgical procedure. Using perioperative and intraoperative features, we developed federated learning models to predict nine major postoperative complications (i.e., prolonged intensive care unit stay and mechanical ventilation). We compared federated learning models with local learning models trained on a single site and central learning models trained on pooled dataset from two centers. Results: Our federated learning models achieved the area under the receiver operating characteristics curve (AUROC) values ranged from 0.81 for wound complications to 0.92 for prolonged ICU stay at UFH GNV center. At UFH JAX center, these values ranged from 0.73-0.74 for wound complications to 0.92-0.93 for hospital mortality. Federated learning models achieved comparable AUROC performance to central learning models, except for prolonged ICU stay, where the performance of federated learning models was slightly higher than central learning models at UFH GNV center, but slightly lower at UFH JAX center. In addition, our federated learning model obtained comparable performance to the best local learning model at each center, demonstrating strong generalizability. Conclusion: Federated learning is shown to be a useful tool to train robust and generalizable models from large scale data across multiple institutions where data protection barriers are high.
Abstract:The breadth, scale, and temporal granularity of modern electronic health records (EHR) systems offers great potential for estimating personalized and contextual patient health trajectories using sequential deep learning. However, learning useful representations of EHR data is challenging due to its high dimensionality, sparsity, multimodality, irregular and variable-specific recording frequency, and timestamp duplication when multiple measurements are recorded simultaneously. Although recent efforts to fuse structured EHR and unstructured clinical notes suggest the potential for more accurate prediction of clinical outcomes, less focus has been placed on EHR embedding approaches that directly address temporal EHR challenges by learning time-aware representations from multimodal patient time series. In this paper, we introduce a dynamic embedding and tokenization framework for precise representation of multimodal clinical time series that combines novel methods for encoding time and sequential position with temporal cross-attention. Our embedding and tokenization framework, when integrated into a multitask transformer classifier with sliding window attention, outperformed baseline approaches on the exemplar task of predicting the occurrence of nine postoperative complications of more than 120,000 major inpatient surgeries using multimodal data from three hospitals and two academic health centers in the United States.
Abstract:Background: Acute kidney injury (AKI), the decline of kidney excretory function, occurs in up to 18% of hospitalized admissions. Progression of AKI may lead to irreversible kidney damage. Methods: This retrospective cohort study includes adult patients admitted to a non-intensive care unit at the University of Pittsburgh Medical Center (UPMC) (n = 46,815) and University of Florida Health (UFH) (n = 127,202). We developed and compared deep learning and conventional machine learning models to predict progression to Stage 2 or higher AKI within the next 48 hours. We trained local models for each site (UFH Model trained on UFH, UPMC Model trained on UPMC) and a separate model with a development cohort of patients from both sites (UFH-UPMC Model). We internally and externally validated the models on each site and performed subgroup analyses across sex and race. Results: Stage 2 or higher AKI occurred in 3% (n=3,257) and 8% (n=2,296) of UFH and UPMC patients, respectively. Area under the receiver operating curve values (AUROC) for the UFH test cohort ranged between 0.77 (UPMC Model) and 0.81 (UFH Model), while AUROC values ranged between 0.79 (UFH Model) and 0.83 (UPMC Model) for the UPMC test cohort. UFH-UPMC Model achieved an AUROC of 0.81 (95% confidence interval [CI] [0.80, 0.83]) for UFH and 0.82 (95% CI [0.81,0.84]) for UPMC test cohorts; an area under the precision recall curve values (AUPRC) of 0.6 (95% CI, [0.05, 0.06]) for UFH and 0.13 (95% CI, [0.11,0.15]) for UPMC test cohorts. Kinetic estimated glomerular filtration rate, nephrotoxic drug burden and blood urea nitrogen remained the top three features with the highest influence across the models and health centers. Conclusion: Locally developed models displayed marginally reduced discrimination when tested on another institution, while the top set of influencing features remained the same across the models and sites.
Abstract:The field of healthcare has increasingly turned its focus towards Large Language Models (LLMs) due to their remarkable performance. However, their performance in actual clinical applications has been underexplored. Traditional evaluations based on question-answering tasks don't fully capture the nuanced contexts. This gap highlights the need for more in-depth and practical assessments of LLMs in real-world healthcare settings. Objective: We sought to evaluate the performance of LLMs in the complex clinical context of adult critical care medicine using systematic and comprehensible analytic methods, including clinician annotation and adjudication. Methods: We investigated the performance of three general LLMs in understanding and processing real-world clinical notes. Concepts from 150 clinical notes were identified by MetaMap and then labeled by 9 clinicians. Each LLM's proficiency was evaluated by identifying the temporality and negation of these concepts using different prompts for an in-depth analysis. Results: GPT-4 showed overall superior performance compared to other LLMs. In contrast, both GPT-3.5 and text-davinci-003 exhibit enhanced performance when the appropriate prompting strategies are employed. The GPT family models have demonstrated considerable efficiency, evidenced by their cost-effectiveness and time-saving capabilities. Conclusion: A comprehensive qualitative performance evaluation framework for LLMs is developed and operationalized. This framework goes beyond singular performance aspects. With expert annotations, this methodology not only validates LLMs' capabilities in processing complex medical data but also establishes a benchmark for future LLM evaluations across specialized domains.
Abstract:Advancements in medical imaging and endovascular grafting have facilitated minimally invasive treatments for aortic diseases. Accurate 3D segmentation of the aorta and its branches is crucial for interventions, as inaccurate segmentation can lead to erroneous surgical planning and endograft construction. Previous methods simplified aortic segmentation as a binary image segmentation problem, overlooking the necessity of distinguishing between individual aortic branches. In this paper, we introduce Context Infused Swin-UNet (CIS-UNet), a deep learning model designed for multi-class segmentation of the aorta and thirteen aortic branches. Combining the strengths of Convolutional Neural Networks (CNNs) and Swin transformers, CIS-UNet adopts a hierarchical encoder-decoder structure comprising a CNN encoder, symmetric decoder, skip connections, and a novel Context-aware Shifted Window Self-Attention (CSW-SA) as the bottleneck block. Notably, CSW-SA introduces a unique utilization of the patch merging layer, distinct from conventional Swin transformers. It efficiently condenses the feature map, providing a global spatial context and enhancing performance when applied at the bottleneck layer, offering superior computational efficiency and segmentation accuracy compared to the Swin transformers. We trained our model on computed tomography (CT) scans from 44 patients and tested it on 15 patients. CIS-UNet outperformed the state-of-the-art SwinUNetR segmentation model, which is solely based on Swin transformers, by achieving a superior mean Dice coefficient of 0.713 compared to 0.697, and a mean surface distance of 2.78 mm compared to 3.39 mm. CIS-UNet's superior 3D aortic segmentation offers improved precision and optimization for planning endovascular treatments. Our dataset and code will be publicly available.
Abstract:The acuity state of patients in the intensive care unit (ICU) can quickly change from stable to unstable, sometimes leading to life-threatening conditions. Early detection of deteriorating conditions can result in providing more timely interventions and improved survival rates. Current approaches rely on manual daily assessments. Some data-driven approaches have been developed, that use mortality as a proxy of acuity in the ICU. However, these methods do not integrate acuity states to determine the stability of a patient or the need for life-sustaining therapies. In this study, we propose APRICOT (Acuity Prediction in Intensive Care Unit), a Transformer-based neural network to predict acuity state in real-time in ICU patients. We develop and extensively validate externally, temporally, and prospectively the APRICOT model on three large datasets: University of Florida Health (UFH), eICU Collaborative Research Database (eICU), and Medical Information Mart for Intensive Care (MIMIC)-IV. The performance of APRICOT shows comparable results to state-of-the-art mortality prediction models (external AUROC 0.93-0.93, temporal AUROC 0.96-0.98, and prospective AUROC 0.98) as well as acuity prediction models (external AUROC 0.80-0.81, temporal AUROC 0.77-0.78, and prospective AUROC 0.87). Furthermore, APRICOT can make predictions for the need for life-sustaining therapies, showing comparable results to state-of-the-art ventilation prediction models (external AUROC 0.80-0.81, temporal AUROC 0.87-0.88, and prospective AUROC 0.85), and vasopressor prediction models (external AUROC 0.82-0.83, temporal AUROC 0.73-0.75, prospective AUROC 0.87). This tool allows for real-time acuity monitoring of a patient and can provide helpful information to clinicians to make timely interventions. Furthermore, the model can suggest life-sustaining therapies that the patient might need in the next hours in the ICU.
Abstract:Acuity assessments are vital in critical care settings to provide timely interventions and fair resource allocation. Traditional acuity scores rely on manual assessments and documentation of physiological states, which can be time-consuming, intermittent, and difficult to use for healthcare providers. Furthermore, such scores do not incorporate granular information such as patients' mobility level, which can indicate recovery or deterioration in the ICU. We hypothesized that existing acuity scores could be potentially improved by employing Artificial Intelligence (AI) techniques in conjunction with Electronic Health Records (EHR) and wearable sensor data. In this study, we evaluated the impact of integrating mobility data collected from wrist-worn accelerometers with clinical data obtained from EHR for developing an AI-driven acuity assessment score. Accelerometry data were collected from 86 patients wearing accelerometers on their wrists in an academic hospital setting. The data was analyzed using five deep neural network models: VGG, ResNet, MobileNet, SqueezeNet, and a custom Transformer network. These models outperformed a rule-based clinical score (SOFA= Sequential Organ Failure Assessment) used as a baseline, particularly regarding the precision, sensitivity, and F1 score. The results showed that while a model relying solely on accelerometer data achieved limited performance (AUC 0.50, Precision 0.61, and F1-score 0.68), including demographic information with the accelerometer data led to a notable enhancement in performance (AUC 0.69, Precision 0.75, and F1-score 0.67). This work shows that the combination of mobility and patient information can successfully differentiate between stable and unstable states in critically ill patients.
Abstract:Initial hours of hospital admission impact clinical trajectory, but early clinical decisions often suffer due to data paucity. With clustering analysis for vital signs within six hours of admission, patient phenotypes with distinct pathophysiological signatures and outcomes may support early clinical decisions. We created a single-center, longitudinal EHR dataset for 75,762 adults admitted to a tertiary care center for 6+ hours. We proposed a deep temporal interpolation and clustering network to extract latent representations from sparse, irregularly sampled vital sign data and derived distinct patient phenotypes in a training cohort (n=41,502). Model and hyper-parameters were chosen based on a validation cohort (n=17,415). Test cohort (n=16,845) was used to analyze reproducibility and correlation with biomarkers. The training, validation, and testing cohorts had similar distributions of age (54-55 yrs), sex (55% female), race, comorbidities, and illness severity. Four clusters were identified. Phenotype A (18%) had most comorbid disease with higher rate of prolonged respiratory insufficiency, acute kidney injury, sepsis, and three-year mortality. Phenotypes B (33%) and C (31%) had diffuse patterns of mild organ dysfunction. Phenotype B had favorable short-term outcomes but second-highest three-year mortality. Phenotype C had favorable clinical outcomes. Phenotype D (17%) had early/persistent hypotension, high rate of early surgery, and substantial biomarker rate of inflammation but second-lowest three-year mortality. After comparing phenotypes' SOFA scores, clustering results did not simply repeat other acuity assessments. In a heterogeneous cohort, four phenotypes with distinct categories of disease and outcomes were identified by a deep temporal interpolation and clustering network. This tool may impact triage decisions and clinical decision-support under time constraints.