Abstract:Feature attribution methods have become a staple method to disentangle the complex behavior of black box models. Despite their success, some scholars have argued that such methods suffer from a serious flaw: they do not allow a reliable interpretation in terms of human concepts. Simply put, visualizing an array of feature contributions is not enough for humans to conclude something about a model's internal representations, and confirmation bias can trick users into false beliefs about model behavior. We argue that a structured approach is required to test whether our hypotheses on the model are confirmed by the feature attributions. This is what we call the "semantic match" between human concepts and (sub-symbolic) explanations. Building on the conceptual framework put forward in Cin\`a et al. [2023], we propose a structured approach to evaluate semantic match in practice. We showcase the procedure in a suite of experiments spanning tabular and image data, and show how the assessment of semantic match can give insight into both desirable (e.g., focusing on an object relevant for prediction) and undesirable model behaviors (e.g., focusing on a spurious correlation). We couple our experimental results with an analysis on the metrics to measure semantic match, and argue that this approach constitutes the first step towards resolving the issue of confirmation bias in XAI.
Abstract:Automated segmentation of human cardiac magnetic resonance datasets has been steadily improving during recent years. However, these methods are not directly applicable in preclinical context due to limited datasets and lower image resolution. Successful application of deep architectures for rat cardiac segmentation, although of critical importance for preclinical evaluation of cardiac function, has to our knowledge not yet been reported. We developed segmentation models that expand on the standard U-Net architecture and evaluated separate models for systole and diastole phases, 2MSA, and one model for all timepoints, 1MSA. Furthermore, we calibrated model outputs using a Gaussian Process (GP)-based prior to improve phase selection. Resulting models approach human performance in terms of left ventricular segmentation quality and ejection fraction (EF) estimation in both 1MSA and 2MSA settings (S{\o}rensen-Dice score 0.91 +/- 0.072 and 0.93 +/- 0.032, respectively). 2MSA achieved a mean absolute difference between estimated and reference EF of 3.5 +/- 2.5 %, while 1MSA resulted in 4.1 +/- 3.0 %. Applying Gaussian Processes to 1MSA allows to automate the selection of systole and diastole phases. Combined with a novel cardiac phase selection strategy, our work presents an important first step towards a fully automated segmentation pipeline in the context of rat cardiac analysis.