Abstract:Vertebral detection and segmentation are critical steps for treatment planning in spine surgery and radiation therapy. Accurate identification and segmentation are complicated in imaging that does not include the full spine, in cases with variations in anatomy (T13 and/or L6 vertebrae), and in the presence of fracture or hardware. This paper proposes VertDetect, a fully automated end-to-end 3D vertebral instance segmentation Convolutional Neural Network (CNN) model to predict vertebral level labels and segmentations for all vertebrae present in a CT scan. The utilization of a shared CNN backbone provides the detection and segmentation branches of the network with feature maps containing both spinal and vertebral level information. A Graph Convolutional Network (GCN) layer is used to improve vertebral labelling by using the known structure of the spine. This model achieved a Dice Similarity Coefficient (DSC) of 0.883 (95% CI, 0.843-0.906) and 0.882 (95% CI, 0.835-0.909) in the VerSe 2019 and 0.868 (95\% CI, 0.834-0.890) and 0.869 (95\% CI, 0.832-0.891) in the VerSe 2020 public and hidden test sets, respectively. This model achieved state-of-the-art performance for an end-to-end architecture, whose design facilitates the extraction of features that can be subsequently used for downstream tasks.
Abstract:Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation.
Abstract:The field of automatic biomedical image analysis crucially depends on robust and meaningful performance metrics for algorithm validation. Current metric usage, however, is often ill-informed and does not reflect the underlying domain interest. Here, we present a comprehensive framework that guides researchers towards choosing performance metrics in a problem-aware manner. Specifically, we focus on biomedical image analysis problems that can be interpreted as a classification task at image, object or pixel level. The framework first compiles domain interest-, target structure-, data set- and algorithm output-related properties of a given problem into a problem fingerprint, while also mapping it to the appropriate problem category, namely image-level classification, semantic segmentation, instance segmentation, or object detection. It then guides users through the process of selecting and applying a set of appropriate validation metrics while making them aware of potential pitfalls related to individual choices. In this paper, we describe the current status of the Metrics Reloaded recommendation framework, with the goal of obtaining constructive feedback from the image analysis community. The current version has been developed within an international consortium of more than 60 image analysis experts and will be made openly available as a user-friendly toolkit after community-driven optimization.
Abstract:Deep artificial neural networks (DNNs) have moved to the forefront of medical image analysis due to their success in classification, segmentation, and detection challenges. A principal challenge in large-scale deployment of DNNs in neuroimage analysis is the potential for shifts in signal-to-noise ratio, contrast, resolution, and presence of artifacts from site to site due to variances in scanners and acquisition protocols. DNNs are famously susceptible to these distribution shifts in computer vision. Currently, there are no benchmarking platforms or frameworks to assess the robustness of new and existing models to specific distribution shifts in MRI, and accessible multi-site benchmarking datasets are still scarce or task-specific. To address these limitations, we propose ROOD-MRI: a platform for benchmarking the Robustness of DNNs to Out-Of-Distribution (OOD) data, corruptions, and artifacts in MRI. The platform provides modules for generating benchmarking datasets using transforms that model distribution shifts in MRI, implementations of newly derived benchmarking metrics for image segmentation, and examples for using the methodology with new models and tasks. We apply our methodology to hippocampus, ventricle, and white matter hyperintensity segmentation in several large studies, providing the hippocampus dataset as a publicly available benchmark. By evaluating modern DNNs on these datasets, we demonstrate that they are highly susceptible to distribution shifts and corruptions in MRI. We show that while data augmentation strategies can substantially improve robustness to OOD data for anatomical segmentation tasks, modern DNNs using augmentation still lack robustness in more challenging lesion-based segmentation tasks. We finally benchmark U-Nets and transformer-based models, finding consistent differences in robustness to particular classes of transforms across architectures.
Abstract:Cancer stage is a large determinant of patient prognosis and management in many cancer types, and is often assessed using medical imaging modalities, such as CT and MRI. These medical images contain rich information that can be explored to stratify patients within each stage group to further improve prognostic algorithms. Although the majority of cancer deaths result from metastatic and multifocal disease, building imaging biomarkers for patients with multiple tumors has been a challenging task due to the lack of annotated datasets and standard study framework. In this paper, we process two public datasets to set up a benchmark cohort of 341 patient in total for studying outcome prediction of multifocal metastatic cancer. We identify the lack of expressiveness in common multiple instance classification networks and propose two injective multiple instance pooling functions that are better suited to outcome prediction. Our results show that multiple instance learning with injective pooling functions can achieve state-of-the-art performance in the non-small-cell lung cancer CT and head and neck CT outcome prediction benchmarking tasks. We will release the processed multifocal datasets, our code and the intermediate files i.e. extracted radiomic features to support further transparent and reproducible research.
Abstract:Radiology reports are the main form of communication between radiologists and other clinicians, and contain important information for patient care. However in order to use this information for research it is necessary to convert the raw text into structured data suitable for analysis. Domain specific contextual word embeddings have been shown to achieve impressive accuracy at such natural language processing tasks in medicine. In this work we pre-trained a contextual embedding BERT model using breast radiology reports and developed a classifier that incorporated the embedding with auxiliary global textual features in order to perform a section tokenization task. This model achieved a 98% accuracy at segregating free text reports into sections of information outlined in the Breast Imaging Reporting and Data System (BI-RADS) lexicon, a significant improvement over the Classic BERT model without auxiliary information. We then evaluated whether using section tokenization improved the downstream extraction of the following fields: modality/procedure, previous cancer, menopausal status, purpose of exam, breast density and background parenchymal enhancement. Using the BERT model pre-trained on breast radiology reports combined with section tokenization resulted in an overall accuracy of 95.9% in field extraction. This is a 17% improvement compared to an overall accuracy of 78.9% for field extraction for models without section tokenization and with Classic BERT embeddings. Our work shows the strength of using BERT in radiology report analysis and the advantages of section tokenization in identifying key features of patient factors recorded in breast radiology reports.
Abstract:We investigate the utility of pretraining by contrastive self supervised learning on both natural-scene and medical imaging datasets when the unlabeled dataset size is small, or when the diversity within the unlabeled set does not lead to better representations. We use a two step approach which is analogous to supervised training with ImageNet initialization, where we pretrain networks that are already pretrained on ImageNet dataset to improve downstream task performance on the domain of interest. To improve the speed of convergence and the overall performance, we propose weight scaling and filter selection methods prior to second step of pretraining. We demonstrate the utility of this approach on three popular contrastive techniques, namely SimCLR, SWaV and BYOL. Benefits of double pretraining include better performance, faster convergence, ability to train with smaller batch sizes and smaller image dimensions with negligible differences in performance. We hope our work helps democratize self-supervision by enabling researchers to fine-tune models without requiring large clusters or long training times.
Abstract:Training a neural network with a large labeled dataset is still a dominant paradigm in computational histopathology. However, obtaining such exhaustive manual annotations is often expensive, laborious, and prone to inter and Intra-observer variability. While recent self-supervised and semi-supervised methods can alleviate this need by learn-ing unsupervised feature representations, they still struggle to generalize well to downstream tasks when the number of labeled instances is small. In this work, we overcome this challenge by leveraging both task-agnostic and task-specific unlabeled data based on two novel strategies: i) a self-supervised pretext task that harnesses the underlying multi-resolution contextual cues in histology whole-slide images to learn a powerful supervisory signal for unsupervised representation learning; ii) a new teacher-student semi-supervised consistency paradigm that learns to effectively transfer the pretrained representations to downstream tasks based on prediction consistency with the task-specific un-labeled data. We carry out extensive validation experiments on three histopathology benchmark datasets across two classification and one regression-based tasks, i.e., tumor metastasis detection, tissue type classification, and tumor cellularity quantification. Under limited-label data, the proposed method yields tangible improvements, which is close or even outperforming other state-of-the-art self-supervised and supervised baselines. Furthermore, we empirically show that the idea of bootstrapping the self-supervised pretrained features is an effective way to improve the task-specific semi-supervised learning on standard benchmarks. Code and pretrained models will be made available at: https://github.com/srinidhiPY/SSL_CR_Histo
Abstract:Colorectal cancer is one of the most common and lethal cancers and colorectal cancer liver metastases (CRLM) is the major cause of death in patients with colorectal cancer. Multifocality occurs frequently in CRLM, but is relatively unexplored in CRLM outcome prediction. Most existing clinical and imaging biomarkers do not take the imaging features of all multifocal lesions into account. In this paper, we present an end-to-end autoencoder-based multiple instance neural network (AMINN) for the prediction of survival outcomes in multifocal CRLM patients using radiomic features extracted from contrast-enhanced MRIs. Specifically, we jointly train an autoencoder to reconstruct input features and a multiple instance network to make predictions by aggregating information from all tumour lesions of a patient. In addition, we incorporate a two-step normalization technique to improve the training of deep neural networks, built on the observation that the distributions of radiomic features are almost always severely skewed. Experimental results empirically validated our hypothesis that incorporating imaging features of all lesions improves outcome prediction for multifocal cancer. The proposed ADMINN framework achieved an area under the ROC curve (AUC) of 0.70, which is 19.5% higher than baseline methods. We built a risk score based on the outputs of our network and compared it to other clinical and imaging biomarkers. Our risk score is the only one that achieved statistical significance in univariate and multivariate cox proportional hazard modeling in our cohort of multifocal CRLM patients. The effectiveness of incorporating all lesions and applying two-step normalization is demonstrated by a series of ablation studies. Our code will be released after the peer-review process.
Abstract:Whole slide images (WSIs) pose unique challenges when training deep learning models. They are very large which makes it necessary to break each image down into smaller patches for analysis, image features have to be extracted at multiple scales in order to capture both detail and context, and extreme class imbalances may exist. Significant progress has been made in the analysis of these images, thanks largely due to the availability of public annotated datasets. We postulate, however, that even if a method scores well on a challenge task, this success may not translate to good performance in a more clinically relevant workflow. Many datasets consist of image patches which may suffer from data curation bias; other datasets are only labelled at the whole slide level and the lack of annotations across an image may mask erroneous local predictions so long as the final decision is correct. In this paper, we outline the differences between patch or slide-level classification versus methods that need to localize or segment cancer accurately across the whole slide, and we experimentally verify that best practices differ in both cases. We apply a binary cancer detection network on post neoadjuvant therapy breast cancer WSIs to find the tumor bed outlining the extent of cancer, a task which requires sensitivity and precision across the whole slide. We extensively study multiple design choices and their effects on the outcome, including architectures and augmentations. Furthermore, we propose a negative data sampling strategy, which drastically reduces the false positive rate (7% on slide level) and improves each metric pertinent to our problem, with a 15% reduction in the error of tumor extent.