Abstract:How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across various out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms on three test sets. In addition, we also evaluated pre-existing AI frameworks--which, differing from algorithms, are more flexible and can support different algorithms--including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.
Abstract:As medical datasets rapidly expand, creating detailed annotations of different body structures becomes increasingly expensive and time-consuming. We consider that requesting radiologists to create detailed annotations is unnecessarily burdensome and that pre-existing AI models can largely automate this process. Following the spirit don't use a sledgehammer on a nut, we find that, rather than creating annotations from scratch, radiologists only have to review and edit errors if the Best-AI Labels have mistakes. To obtain the Best-AI Labels among multiple AI Labels, we developed an automatic tool, called Label Critic, that can assess label quality through tireless pairwise comparisons. Extensive experiments demonstrate that, when incorporated with our developed Image-Prompt pairs, pre-existing Large Vision-Language Models (LVLM), trained on natural images and texts, achieve 96.5% accuracy when choosing the best label in a pair-wise comparison, without extra fine-tuning. By transforming the manual annotation task (30-60 min/scan) into an automatic comparison task (15 sec/scan), we effectively reduce the manual efforts required from radiologists by an order of magnitude. When the Best-AI Labels are sufficiently accurate (81% depending on body structures), they will be directly adopted as the gold-standard annotations for the dataset, with lower-quality AI Labels automatically discarded. Label Critic can also check the label quality of a single AI Label with 71.8% accuracy when no alternatives are available for comparison, prompting radiologists to review and edit if the estimated quality is low (19% depending on body structures).
Abstract:We introduce the largest abdominal CT dataset (termed AbdomenAtlas) of 20,460 three-dimensional CT volumes sourced from 112 hospitals across diverse populations, geographies, and facilities. AbdomenAtlas provides 673K high-quality masks of anatomical structures in the abdominal region annotated by a team of 10 radiologists with the help of AI algorithms. We start by having expert radiologists manually annotate 22 anatomical structures in 5,246 CT volumes. Following this, a semi-automatic annotation procedure is performed for the remaining CT volumes, where radiologists revise the annotations predicted by AI, and in turn, AI improves its predictions by learning from revised annotations. Such a large-scale, detailed-annotated, and multi-center dataset is needed for two reasons. Firstly, AbdomenAtlas provides important resources for AI development at scale, branded as large pre-trained models, which can alleviate the annotation workload of expert radiologists to transfer to broader clinical applications. Secondly, AbdomenAtlas establishes a large-scale benchmark for evaluating AI algorithms -- the more data we use to test the algorithms, the better we can guarantee reliable performance in complex clinical scenarios. An ISBI & MICCAI challenge named BodyMaps: Towards 3D Atlas of Human Body was launched using a subset of our AbdomenAtlas, aiming to stimulate AI innovation and to benchmark segmentation accuracy, inference efficiency, and domain generalizability. We hope our AbdomenAtlas can set the stage for larger-scale clinical trials and offer exceptional opportunities to practitioners in the medical imaging community. Codes, models, and datasets are available at https://www.zongweiz.com/dataset
Abstract:Bias and spurious correlations in data can cause shortcut learning, undermining out-of-distribution (OOD) generalization in deep neural networks. Most methods require unbiased data during training (and/or hyper-parameter tuning) to counteract shortcut learning. Here, we propose the use of explanation distillation to hinder shortcut learning. The technique does not assume any access to unbiased data, and it allows an arbitrarily sized student network to learn the reasons behind the decisions of an unbiased teacher, such as a vision-language model or a network processing debiased images. We found that it is possible to train a neural network with explanation (e.g by Layer Relevance Propagation, LRP) distillation only, and that the technique leads to high resistance to shortcut learning, surpassing group-invariant learning, explanation background minimization, and alternative distillation techniques. In the COLOURED MNIST dataset, LRP distillation achieved 98.2% OOD accuracy, while deep feature distillation and IRM achieved 92.1% and 60.2%, respectively. In COCO-on-Places, the undesirable generalization gap between in-distribution and OOD accuracy is only of 4.4% for LRP distillation, while the other two techniques present gaps of 15.1% and 52.1%, respectively.
Abstract:Image background features can constitute background bias (spurious correlations) and impact deep classifiers decisions, causing shortcut learning (Clever Hans effect) and reducing the generalization skill on real-world data. The concept of optimizing Layer-wise Relevance Propagation (LRP) heatmaps, to improve classifier behavior, was recently introduced by a neural network architecture named ISNet. It minimizes background relevance in LRP maps, to mitigate the influence of image background features on deep classifiers decisions, hindering shortcut learning and improving generalization. For each training image, the original ISNet produces one heatmap per possible class in the classification task, hence, its training time scales linearly with the number of classes. Here, we introduce reformulated architectures that allow the training time to become independent from this number, rendering the optimization process much faster. We challenged the enhanced models utilizing the MNIST dataset with synthetic background bias, and COVID-19 detection in chest X-rays, an application that is prone to shortcut learning due to background bias. The trained models minimized background attention and hindered shortcut learning, while retaining high accuracy. Considering external (out-of-distribution) test datasets, they consistently proved more accurate than multiple state-of-the-art deep neural network architectures, including a dedicated image semantic segmenter followed by a classifier. The architectures presented here represent a potentially massive improvement in training speed over the original ISNet, thus introducing LRP optimization into a gamut of applications that could not be feasibly handled by the original model.
Abstract:In this work we propose a novel deep neural network (DNN) architecture, ISNet, to solve the task of image segmentation followed by classification, substituting the common pipeline of two networks by a single model. We designed the ISNet for high flexibility and performance: it allows virtually any classification neural network architecture to analyze a common image as if it had been previously segmented. Furthermore, in relation to the original classifier, the ISNet does not cause any increment in computational cost or architectural changes at run-time. To accomplish this, we introduce the concept of optimizing DNNs for relevance segmentation in heatmaps created by Layer-wise Relevance Propagation (LRP), which proves to be equivalent to the classification of previously segmented images. We apply an ISNet based on a DenseNet121 classifier to solve the task of COVID-19 detection in chest X-rays. We compare the model to a U-net (performing lung segmentation) followed by a DenseNet121, and to a standalone DenseNet121. Due to the implicit segmentation, the ISNet precisely ignored the X-ray regions outside of the lungs; it achieved 94.5 +/-4.1% mean accuracy with an external database, showing strong generalization capability and surpassing the other models' performances by 6 to 7.9%. ISNet presents a fast and light methodology to perform classification preceded by segmentation, while also being more accurate than standard pipelines.
Abstract:Objective: To propose a novel deep neural network (DNN) architecture -- the filter bank convolutional neural network (FBCNN) -- to improve SSVEP classification in single-channel BCIs with small data lengths. Methods: We propose two models: the FBCNN-2D and the FBCNN-3D. The FBCNN-2D utilizes a filter bank to create sub-band components of the electroencephalography (EEG) signal, which it transforms using the fast Fourier transform (FFT) and analyzes with a 2D CNN. The FBCNN-3D utilizes the same filter bank, but it transforms the sub-band components into spectrograms via short-time Fourier transform (STFT), and analyzes them with a 3D CNN. We made use of transfer learning. To train the FBCNN-3D, we proposed a new technique, called inter-dimensional transfer learning, to transfer knowledge from a 2D DNN to a 3D DNN. Our BCI was conceived so as not to require calibration from the final user: therefore, the test subject data was separated from training and validation. Results: The mean test accuracy was 85.7% for the FBCCA-2D and 85% for the FBCCA-3D. Mean F1-Scores were 0.858 and 0.853. Alternative classification methods, SVM, FBCCA and a CNN, had mean accuracy of 79.2%, 80.1% and 81.4%, respectively. Conclusion: The FBCNNs surpassed traditional SSVEP classification methods in our simulated BCI, by a considerable margin (about 5% higher accuracy). Transfer learning and inter-dimensional transfer learning made training much faster and more predictable. Significance: We proposed a new and flexible type of DNN, which had a better performance than standard methods in SSVEP classification for portable and fast BCIs.
Abstract:We evaluated the generalization capability of deep neural networks (DNNs), trained to classify chest X-rays as COVID-19, normal or pneumonia, using a relatively small and mixed dataset. We proposed a DNN architecture to perform lung segmentation and classification. It stacks a segmentation module (U-Net), an original intermediate module and a classification module (DenseNet201). We compared it to a DenseNet201. To evaluate generalization, we tested the DNNs with an external dataset (from distinct localities) and used Bayesian inference to estimate the probability distributions of performance metrics, like F1-Score. Our proposed DNN achieved 0.917 AUC on the external test dataset, and the DenseNet, 0.906. Bayesian inference indicated mean accuracy of 76.1% and [0.695, 0.826] 95% HDI with segmentation and, without segmentation, 71.7% and [0.646, 0.786]. We proposed a novel DNN evaluation technique, using Layer-wise Relevance Propagation (LRP) and the Brixia score. LRP heatmaps indicated that areas where radiologists found strong COVID-19 symptoms and attributed high Brixia scores are the most important for the stacked DNN classification. External validation showed smaller accuracies than internal validation, indicating dataset bias, which segmentation reduces. Performance in the external dataset and LRP analysis suggest that DNNs can be trained in small and mixed datasets and detect COVID-19.
Abstract:In this work, we used a deep convolutional neural network (DCNN) to classify electroencephalography (EEG) signals in a steady-state visually evoked potentials (SSVEP) based brain-computer interface (BCI). The raw EEG signals were converted to spectrograms and served as input to train a DCNN using the transfer learning technique. We applied a second technique, data augmentation, mostly SpecAugment, generally employed to speech recognition. The results, when excluding the evaluated user's data from the fine-tuning process, reached 99.3% mean test accuracy and 0.992 mean F1 score on 35 subjects from an open dataset.
Abstract:We present an image classifier based on the CheXNet and a transfer learning stage to classify chest X-Ray images according to three labels: COVID-19, viral pneumonia and normal. CheXNet is a DenseNet121 that has been trained twice, firstly on ImageNet and then, for classification of pneumonia and other 13 chest diseases, over a large chest X-Ray database (ChestX- ray14). The proposed network reached a test accuracy of 97.8% and, for the COVID-19 class, of 98.3%. In order to clarify the modus operandi of the network, we used Layer Wise Relevance Propagation (LRP) to generate heat maps, indicating an analytical path for future research on diagnosis.