Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Eike Petersen

Robustness and sex differences in skin cancer detection: logistic regression vs CNNs

Apr 15, 2025

Nikolette Pedersen, Regitze Sydendal, Andreas Wulff, Ralf Raumanns, Eike Petersen, Veronika Cheplygina

Figure 1 for Robustness and sex differences in skin cancer detection: logistic regression vs CNNs

Figure 2 for Robustness and sex differences in skin cancer detection: logistic regression vs CNNs

Figure 3 for Robustness and sex differences in skin cancer detection: logistic regression vs CNNs

Figure 4 for Robustness and sex differences in skin cancer detection: logistic regression vs CNNs

Abstract:Deep learning has been reported to achieve high performances in the detection of skin cancer, yet many challenges regarding the reproducibility of results and biases remain. This study is a replication (different data, same analysis) of a study on Alzheimer's disease [28] which studied robustness of logistic regression (LR) and convolutional neural networks (CNN) across patient sexes. We explore sex bias in skin cancer detection, using the PAD-UFES-20 dataset with LR trained on handcrafted features reflecting dermatological guidelines (ABCDE and the 7-point checklist), and a pre-trained ResNet-50 model. We evaluate these models in alignment with [28]: across multiple training datasets with varied sex composition to determine their robustness. Our results show that both the LR and the CNN were robust to the sex distributions, but the results also revealed that the CNN had a significantly higher accuracy (ACC) and area under the receiver operating characteristics (AUROC) for male patients than for female patients. We hope these findings to contribute to the growing field of investigating potential bias in popular medical machine learning methods. The data and relevant scripts to reproduce our results can be found in our Github.

* 16 pages (excluding appendix), 2 figures (excluding appendix), submitted to MIUA 2025 conference (response pending)

Via

Access Paper or Ask Questions

Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods

Jun 17, 2024

Vincent Olesen, Nina Weng, Aasa Feragen, Eike Petersen

Abstract:Machine learning models have achieved high overall accuracy in medical image analysis. However, performance disparities on specific patient groups pose challenges to their clinical utility, safety, and fairness. This can affect known patient groups - such as those based on sex, age, or disease subtype - as well as previously unknown and unlabeled groups. Furthermore, the root cause of such observed performance disparities is often challenging to uncover, hindering mitigation efforts. In this paper, to address these issues, we leverage Slice Discovery Methods (SDMs) to identify interpretable underperforming subsets of data and formulate hypotheses regarding the cause of observed performance disparities. We introduce a novel SDM and apply it in a case study on the classification of pneumothorax and atelectasis from chest x-rays. Our study demonstrates the effectiveness of SDMs in hypothesis formulation and yields an explanation of previously observed but unexplained performance disparities between male and female patients in widely used chest X-ray datasets and models. Our findings indicate shortcut learning in both classification tasks, through the presence of chest drains and ECG wires, respectively. Sex-based differences in the prevalence of these shortcut features appear to cause the observed classification performance gap, representing a previously underappreciated interaction between shortcut learning and model fairness analyses.

Via

Access Paper or Ask Questions

Fast Diffusion-Based Counterfactuals for Shortcut Removal and Generation

Dec 21, 2023

Nina Weng, Paraskevas Pegios, Aasa Feragen, Eike Petersen, Siavash Bigdeli

Abstract:Shortcut learning is when a model -- e.g. a cardiac disease classifier -- exploits correlations between the target label and a spurious shortcut feature, e.g. a pacemaker, to predict the target label based on the shortcut rather than real discriminative features. This is common in medical imaging, where treatment and clinical annotations correlate with disease labels, making them easy shortcuts to predict disease. We propose a novel detection and quantification of the impact of potential shortcut features via a fast diffusion-based counterfactual image generation that can synthetically remove or add shortcuts. Via a novel inpainting-based modification we spatially limit the changes made with no extra inference step, encouraging the removal of spatially constrained shortcut features while ensuring that the shortcut-free counterfactuals preserve their remaining image features to a high degree. Using these, we assess how shortcut features influence model predictions. This is enabled by our second contribution: An efficient diffusion-based counterfactual explanation method with significant inference speed-up at comparable image quality as state-of-the-art. We confirm this on two large chest X-ray datasets, a skin lesion dataset, and CelebA.

Via

Access Paper or Ask Questions

Are Sex-based Physiological Differences the Cause of Gender Bias for Chest X-ray Diagnosis?

Aug 09, 2023

Nina Weng, Siavash Bigdeli, Eike Petersen, Aasa Feragen

Figure 1 for Are Sex-based Physiological Differences the Cause of Gender Bias for Chest X-ray Diagnosis?

Figure 2 for Are Sex-based Physiological Differences the Cause of Gender Bias for Chest X-ray Diagnosis?

Figure 3 for Are Sex-based Physiological Differences the Cause of Gender Bias for Chest X-ray Diagnosis?

Figure 4 for Are Sex-based Physiological Differences the Cause of Gender Bias for Chest X-ray Diagnosis?

Abstract:While many studies have assessed the fairness of AI algorithms in the medical field, the causes of differences in prediction performance are often unknown. This lack of knowledge about the causes of bias hampers the efficacy of bias mitigation, as evidenced by the fact that simple dataset balancing still often performs best in reducing performance gaps but is unable to resolve all performance differences. In this work, we investigate the causes of gender bias in machine learning-based chest X-ray diagnosis. In particular, we explore the hypothesis that breast tissue leads to underexposure of the lungs and causes lower model performance. Methodologically, we propose a new sampling method which addresses the highly skewed distribution of recordings per patient in two widely used public datasets, while at the same time reducing the impact of label errors. Our comprehensive analysis of gender differences across diseases, datasets, and gender representations in the training set shows that dataset imbalance is not the sole cause of performance differences. Moreover, relative group performance differs strongly between datasets, indicating important dataset-specific factors influencing male/female group performance. Finally, we investigate the effect of breast tissue more specifically, by cropping out the breasts from recordings, finding that this does not resolve the observed performance gaps. In conclusion, our results indicate that dataset-specific factors, not fundamental physiological differences, are the main drivers of male--female performance gaps in chest X-ray analyses on widely used NIH and CheXpert Dataset.

Via

Access Paper or Ask Questions

Are demographically invariant models and representations in medical imaging fair?

May 02, 2023

Eike Petersen, Enzo Ferrante, Melanie Ganz, Aasa Feragen

Figure 1 for Are demographically invariant models and representations in medical imaging fair?

Figure 2 for Are demographically invariant models and representations in medical imaging fair?

Abstract:Medical imaging models have been shown to encode information about patient demographics (age, race, sex) in their latent representation, raising concerns about their potential for discrimination. Here, we ask whether it is feasible and desirable to train models that do not encode demographic attributes. We consider different types of invariance with respect to demographic attributes - marginal, class-conditional, and counterfactual model invariance - and lay out their equivalence to standard notions of algorithmic fairness. Drawing on existing theory, we find that marginal and class-conditional invariance can be considered overly restrictive approaches for achieving certain fairness notions, resulting in significant predictive performance losses. Concerning counterfactual model invariance, we note that defining medical image counterfactuals with respect to demographic attributes is fraught with complexities. Finally, we posit that demographic encoding may even be considered advantageous if it enables learning a task-specific encoding of demographic features that does not rely on human-constructed categories such as 'race' and 'gender'. We conclude that medical imaging models may need to encode demographic attributes, lending further urgency to calls for comprehensive model fairness assessments in terms of predictive performance.

Via

Access Paper or Ask Questions

That Label's Got Style: Handling Label Style Bias for Uncertain Image Segmentation

Mar 28, 2023

Kilian Zepf, Eike Petersen, Jes Frellsen, Aasa Feragen

Figure 1 for That Label's Got Style: Handling Label Style Bias for Uncertain Image Segmentation

Figure 2 for That Label's Got Style: Handling Label Style Bias for Uncertain Image Segmentation

Figure 3 for That Label's Got Style: Handling Label Style Bias for Uncertain Image Segmentation

Figure 4 for That Label's Got Style: Handling Label Style Bias for Uncertain Image Segmentation

Abstract:Segmentation uncertainty models predict a distribution over plausible segmentations for a given input, which they learn from the annotator variation in the training set. However, in practice these annotations can differ systematically in the way they are generated, for example through the use of different labeling tools. This results in datasets that contain both data variability and differing label styles. In this paper, we demonstrate that applying state-of-the-art segmentation uncertainty models on such datasets can lead to model bias caused by the different label styles. We present an updated modelling objective conditioning on labeling style for aleatoric uncertainty estimation, and modify two state-of-the-art-architectures for segmentation uncertainty accordingly. We show with extensive experiments that this method reduces label style bias, while improving segmentation performance, increasing the applicability of segmentation uncertainty models in the wild. We curate two datasets, with annotations in different label styles, which we will make publicly available along with our code upon publication.

Via

Access Paper or Ask Questions

On the fairness of risk score models

Feb 22, 2023

Eike Petersen, Melanie Ganz, Sune Hannibal Holm, Aasa Feragen

Figure 1 for On the fairness of risk score models

Figure 2 for On the fairness of risk score models

Figure 3 for On the fairness of risk score models

Figure 4 for On the fairness of risk score models

Abstract:Recent work on algorithmic fairness has largely focused on the fairness of discrete decisions, or classifications. While such decisions are often based on risk score models, the fairness of the risk models themselves has received considerably less attention. Risk models are of interest for a number of reasons, including the fact that they communicate uncertainty about the potential outcomes to users, thus representing a way to enable meaningful human oversight. Here, we address fairness desiderata for risk score models. We identify the provision of similar epistemic value to different groups as a key desideratum for risk score fairness. Further, we address how to assess the fairness of risk score models quantitatively, including a discussion of metric choices and meaningful statistical comparisons between groups. In this context, we also introduce a novel calibration error metric that is less sample size-biased than previously proposed metrics, enabling meaningful comparisons between groups of different sizes. We illustrate our methodology - which is widely applicable in many other settings - in two case studies, one in recidivism risk prediction, and one in risk of major depressive disorder (MDD) prediction.

Via

Access Paper or Ask Questions

Feature robustness and sex differences in medical imaging: a case study in MRI-based Alzheimer's disease detection

Apr 12, 2022

Eike Petersen, Aasa Feragen, Maria Luise da Costa Zemsch, Anders Henriksen, Oskar Eiler Wiese Christensen, Melanie Ganz

Figure 1 for Feature robustness and sex differences in medical imaging: a case study in MRI-based Alzheimer's disease detection

Figure 2 for Feature robustness and sex differences in medical imaging: a case study in MRI-based Alzheimer's disease detection

Figure 3 for Feature robustness and sex differences in medical imaging: a case study in MRI-based Alzheimer's disease detection

Figure 4 for Feature robustness and sex differences in medical imaging: a case study in MRI-based Alzheimer's disease detection

Abstract:Convolutional neural networks have enabled significant improvements in medical image-based disease classification. It has, however, become increasingly clear that these models are susceptible to performance degradation due to spurious correlations and dataset shifts, which may lead to underperformance on underrepresented patient groups, among other problems. In this paper, we compare two classification schemes on the ADNI MRI dataset: a very simple logistic regression model that uses manually selected volumetric features as inputs, and a convolutional neural network trained on 3D MRI data. We assess the robustness of the trained models in the face of varying dataset splits, training set sex composition, and stage of disease. In contrast to earlier work on diagnosing lung diseases based on chest x-ray data, we do not find a strong dependence of model performance for male and female test subjects on the sex composition of the training dataset. Moreover, in our analysis, the low-dimensional model with manually selected features outperforms the 3D CNN, thus emphasizing the need for automatic robust feature extraction methods and the value of manual feature specification (based on prior knowledge) for robustness.

* Submitted to MICCAI 2022

Via

Access Paper or Ask Questions

Responsible and Regulatory Conform Machine Learning for Medicine: A Survey of Technical Challenges and Solutions

Jul 20, 2021

Eike Petersen, Yannik Potdevin, Esfandiar Mohammadi, Stephan Zidowitz, Sabrina Breyer, Dirk Nowotka, Sandra Henn, Ludwig Pechmann, Martin Leucker, Philipp Rostalski(+1 more)

Figure 1 for Responsible and Regulatory Conform Machine Learning for Medicine: A Survey of Technical Challenges and Solutions

Figure 2 for Responsible and Regulatory Conform Machine Learning for Medicine: A Survey of Technical Challenges and Solutions

Figure 3 for Responsible and Regulatory Conform Machine Learning for Medicine: A Survey of Technical Challenges and Solutions

Figure 4 for Responsible and Regulatory Conform Machine Learning for Medicine: A Survey of Technical Challenges and Solutions

Abstract:Machine learning is expected to fuel significant improvements in medical care. To ensure that fundamental principles such as beneficence, respect for human autonomy, prevention of harm, justice, privacy, and transparency are respected, medical machine learning applications must be developed responsibly. In this paper, we survey the technical challenges involved in creating medical machine learning systems responsibly and in conformity with existing regulations, as well as possible solutions to address these challenges. We begin by providing a brief overview of existing regulations affecting medical machine learning, showing that properties such as safety, robustness, reliability, privacy, security, transparency, explainability, and nondiscrimination are all demanded already by existing law and regulations - albeit, in many cases, to an uncertain degree. Next, we discuss the underlying technical challenges, possible ways for addressing them, and their respective merits and drawbacks. We notice that distribution shift, spurious correlations, model underspecification, and data scarcity represent severe challenges in the medical context (and others) that are very difficult to solve with classical black-box deep neural networks. Important measures that may help to address these challenges include the use of large and representative datasets and federated learning as a means to that end, the careful exploitation of domain knowledge wherever feasible, the use of inherently transparent models, comprehensive model testing and verification, as well as stakeholder inclusion.

* Preprint submitted to Artificial Intelligence in Medicine

Via

Access Paper or Ask Questions

On Approximate Nonlinear Gaussian Message Passing On Factor Graphs

Mar 21, 2019

Eike Petersen, Christian Hoffmann, Philipp Rostalski

Figure 1 for On Approximate Nonlinear Gaussian Message Passing On Factor Graphs

Figure 2 for On Approximate Nonlinear Gaussian Message Passing On Factor Graphs

Figure 3 for On Approximate Nonlinear Gaussian Message Passing On Factor Graphs

Abstract:Factor graphs have recently gained increasing attention as a unified framework for representing and constructing algorithms for signal processing, estimation, and control. One capability that does not seem to be well explored within the factor graph tool kit is the ability to handle deterministic nonlinear transformations, such as those occurring in nonlinear filtering and smoothing problems, using tabulated message passing rules. In this contribution, we provide general forward (filtering) and backward (smoothing) approximate Gaussian message passing rules for deterministic nonlinear transformation nodes in arbitrary factor graphs fulfilling a Markov property, based on numerical quadrature procedures for the forward pass and a Rauch-Tung-Striebel-type approximation of the backward pass. These message passing rules can be employed for deriving many algorithms for solving nonlinear problems using factor graphs, as is illustrated by the proposition of a nonlinear modified Bryson-Frazier (MBF) smoother based on the presented message passing rules.

* 2018 IEEE Statistical Signal Processing Workshop (SSP)

Via

Access Paper or Ask Questions