Abstract:Introduction: Ensuring timely and accurate diagnosis of medical conditions is paramount for effective patient care. Electrocardiogram (ECG) signals are fundamental for evaluating a patient's cardiac health and are readily available. Despite this, little attention has been given to the remarkable potential of ECG data in detecting non-cardiac conditions. Methods: In our study, we used publicly available datasets (MIMIC-IV-ECG-ICD and ECG-VIEW II) to investigate the feasibility of inferring general diagnostic conditions from ECG features. To this end, we trained a tree-based model (XGBoost) based on ECG features and basic demographic features to estimate a wide range of diagnoses, encompassing both cardiac and non-cardiac conditions. Results: Our results demonstrate the reliability of estimating 23 cardiac as well as 21 non-cardiac conditions above 0.7 AUROC in a statistically significant manner across a wide range of physiological categories. Our findings underscore the predictive potential of ECG data in identifying well-known cardiac conditions. However, even more striking, this research represents a pioneering effort in systematically expanding the scope of ECG-based diagnosis to conditions not traditionally associated with the cardiac system.
Abstract:Background: Benchmarking medical decision support algorithms often struggles due to limited access to datasets, narrow prediction tasks, and restricted input modalities. These limitations affect their clinical relevance and performance in high-stakes areas like emergency care, complicating replication, validation, and improvement of benchmarks. Methods: We introduce a dataset based on MIMIC-IV, benchmarking protocol, and initial results for evaluating multimodal decision support in the emergency department (ED). We use diverse data modalities from the first 1.5 hours of patient arrival, including demographics, biometrics, vital signs, lab values, and electrocardiogram waveforms. We analyze 1443 clinical labels across two contexts: predicting diagnoses with ICD-10 codes and forecasting patient deterioration. Results: Our multimodal diagnostic model achieves an AUROC score over 0.8 in a statistically significant manner for 357 out of 1428 conditions, including cardiac issues like myocardial infarction and non-cardiac conditions such as renal disease and diabetes. The deterioration model scores above 0.8 in a statistically significant manner for 13 out of 15 targets, including critical events like cardiac arrest and mechanical ventilation, ICU admission as well as short- and long-term mortality. Incorporating raw waveform data significantly improves model performance, which represents one of the first robust demonstrations of this effect. Conclusions: This study highlights the uniqueness of our dataset, which encompasses a wide range of clinical tasks and utilizes a comprehensive set of features collected early during the emergency after arriving at the ED. The strong performance, as evidenced by high AUROC scores across diagnostic and deterioration targets, underscores the potential of our approach to revolutionize decision-making in acute and emergency medicine.
Abstract:Introduction: Laboratory value represents a cornerstone of medical diagnostics, but suffers from slow turnaround times, and high costs and only provides information about a single point in time. The continuous estimation of laboratory values from non-invasive data such as electrocardiogram (ECG) would therefore mark a significant frontier in healthcare monitoring. Despite its transformative potential, this domain remains relatively underexplored within the medical community. Methods: In this preliminary study, we used a publicly available dataset (MIMIC-IV-ECG) to investigate the feasibility of inferring laboratory values from ECG features and patient demographics using tree-based models (XGBoost). We define the prediction task as a binary prediction problem of predicting whether the lab value falls into low or high abnormalities. The model performance can then be assessed using AUROC. Results: Our findings demonstrate promising results in the estimation of laboratory values related to different organ systems based on a small yet comprehensive set of features. While further research and validation are warranted to fully assess the clinical utility and generalizability of ECG-based estimation in healthcare monitoring, our findings lay the groundwork for future investigations into approaches to laboratory value estimation using ECG data. Such advancements hold promise for revolutionizing predictive healthcare applications, offering faster, non-invasive, and more affordable means of patient monitoring.
Abstract:Despite the excelling performance of machine learning models, understanding the decisions of machine learning models remains a long-standing goal. While commonly used attribution methods in explainable AI attempt to address this issue, they typically rely on associational rather than causal relationships. In this study, within the context of time series classification, we introduce a novel framework to assess the causal effect of concepts, i.e., predefined segments within a time series, on specific classification outcomes. To achieve this, we leverage state-of-the-art diffusion-based generative models to estimate counterfactual outcomes. Our approach compares these causal attributions with closely related associational attributions, both theoretically and empirically. We demonstrate the insights gained by our approach for a diverse set of qualitatively different time series classification tasks. Although causal and associational attributions might often share some similarities, in all cases they differ in important details, underscoring the risks associated with drawing causal conclusions from associational data alone. We believe that the proposed approach is widely applicable also in other domains, particularly where predefined segmentations are available, to shed some light on the limits of associational attributions.
Abstract:This study aims to elucidate the significance of long-range correlations for deep-learning-based sleep staging. It is centered around S4Sleep(TS), a recently proposed model for automated sleep staging. This model utilizes electroencephalography (EEG) as raw time series input and relies on structured state space sequence (S4) models as essential model component. Although the model already surpasses state-of-the-art methods for a moderate number of 15 input epochs, recent literature results suggest potential benefits from incorporating very long correlations spanning hundreds of input epochs. In this submission, we explore the possibility of achieving further enhancements by systematically scaling up the model's input size, anticipating potential improvements in prediction accuracy. In contrast to findings in literature, our results demonstrate that augmenting the input size does not yield a significant enhancement in the performance of S4Sleep(TS). These findings, coupled with the distinctive ability of S4 models to capture long-range dependencies in time series data, cast doubt on the diagnostic relevance of very long-range interactions for sleep staging.
Abstract:Feature removal is a central building block for eXplainable AI (XAI), both for occlusion-based explanations (Shapley values) as well as their evaluation (pixel flipping, PF). However, occlusion strategies can vary significantly from simple mean replacement up to inpainting with state-of-the-art diffusion models. This ambiguity limits the usefulness of occlusion-based approaches. For example, PF benchmarks lead to contradicting rankings. This is amplified by competing PF measures: Features are either removed starting with most influential first (MIF) or least influential first (LIF). This study proposes two complementary perspectives to resolve this disagreement problem. Firstly, we address the common criticism of occlusion-based XAI, that artificial samples lead to unreliable model evaluations. We propose to measure the reliability by the R(eference)-Out-of-Model-Scope (OMS) score. The R-OMS score enables a systematic comparison of occlusion strategies and resolves the disagreement problem by grouping consistent PF rankings. Secondly, we show that the insightfulness of MIF and LIF is conversely dependent on the R-OMS score. To leverage this, we combine the MIF and LIF measures into the symmetric relevance gain (SRG) measure. This breaks the inherent connection to the underlying occlusion strategy and leads to consistent rankings. This resolves the disagreement problem, which we verify for a set of 40 different occlusion strategies.
Abstract:Current deep learning algorithms designed for automatic ECG analysis have exhibited notable accuracy. However, akin to traditional electrocardiography, they tend to be narrowly focused and typically address a singular diagnostic condition. In this study, we specifically demonstrate the capability of a single model to predict a diverse range of both cardiac and non-cardiac discharge diagnoses based on a sole ECG collected in the emergency department. Among the 1,076 hierarchically structured ICD codes considered, our model achieves an AUROC exceeding 0.8 in 439 of them. This underscores the models proficiency in handling a wide array of diagnostic scenarios. We emphasize the potential of utilizing this model as a screening tool, potentially integrated into a holistic clinical decision support system for efficiently triaging patients in the emergency department. This research underscores the remarkable capabilities of comprehensive ECG analysis algorithms and the extensive range of possibilities facilitated by the open MIMIC-IV-ECG dataset. Finally, our data may play a pivotal role in revolutionizing the way ECG analysis is performed, marking a significant advancement in the field.
Abstract:Cardiovascular diseases remain the leading global cause of mortality. This necessitates a profound understanding of heart aging processes to diagnose constraints in cardiovascular fitness. Traditionally, most of such insights have been drawn from the analysis of electrocardiogram (ECG) feature changes of individuals as they age. However, these features, while informative, may potentially obscure underlying data relationships. In this paper, we employ a deep-learning model and a tree-based model to analyze ECG data from a robust dataset of healthy individuals across varying ages in both raw signals and ECG feature format. Explainable AI techniques are then used to identify ECG features or raw signal characteristics are most discriminative for distinguishing between age groups. Our analysis with tree-based classifiers reveal age-related declines in inferred breathing rates and identifies notably high SDANN values as indicative of elderly individuals, distinguishing them from younger adults. Furthermore, the deep-learning model underscores the pivotal role of the P-wave in age predictions across all age groups, suggesting potential changes in the distribution of different P-wave types with age. These findings shed new light on age-related ECG changes, offering insights that transcend traditional feature-based approaches.
Abstract:Scoring sleep stages in polysomnography recordings is a time-consuming task plagued by significant inter-rater variability. Therefore, it stands to benefit from the application of machine learning algorithms. While many algorithms have been proposed for this purpose, certain critical architectural decisions have not received systematic exploration. In this study, we meticulously investigate these design choices within the broad category of encoder-predictor architectures. We identify robust architectures applicable to both time series and spectrogram input representations. These architectures incorporate structured state space models as integral components, leading to statistically significant advancements in performance on the extensive SHHS dataset. These improvements are assessed through both statistical and systematic error estimations. We anticipate that the architectural insights gained from this study will not only prove valuable for future research in sleep staging but also hold relevance for other time series annotation tasks.
Abstract:Motivation: We explored how explainable AI (XAI) can help to shed light into the inner workings of neural networks for protein function prediction, by extending the widely used XAI method of integrated gradients such that latent representations inside of transformer models, which were finetuned to Gene Ontology term and Enzyme Commission number prediction, can be inspected too. Results: The approach enabled us to identify amino acids in the sequences that the transformers pay particular attention to, and to show that these relevant sequence parts reflect expectations from biology and chemistry, both in the embedding layer and inside of the model, where we identified transformer heads with a statistically significant correspondence of attribution maps with ground truth sequence annotations (e.g., transmembrane regions, active sites) across many proteins. Availability and Implementation: Source code can be accessed at https://github.com/markuswenzel/xai-proteins .