Abstract:The classification of gigapixel-sized whole slide images (WSIs), digital representations of histological slides obtained via a high-resolution scanner, faces significant challenges associated with the meticulous and time-consuming nature of fine-grained labeling. While weakly-supervised multiple instance learning (MIL) has emerged as a promising approach, current MIL methods are constrained by their limited ability to leverage the wealth of information embedded within unlabeled WSIs. This limitation often necessitates training MIL feature aggregators from scratch after the feature extraction process, hindering efficiency and accuracy. PreMix extends the general MIL framework by pre-training the MIL aggregator with an intra-batch slide mixing approach. Specifically, PreMix incorporates Barlow Twins Slide Mixing during pre-training, enhancing its ability to handle diverse WSI sizes and maximizing the utility of unlabeled WSIs. Combined with Mixup and Manifold Mixup during fine-tuning, PreMix achieves a mean of 4.7% performance improvement over the baseline MIL framework, the hierarchical image pyramid transformer (HIPT), on the Camelyon16 dataset. The observed improvement across a range of active learning acquisition functions and WSI-labeled training budgets highlights the framework's adaptability to diverse datasets and varying resource constraints. Ultimately, PreMix paves the way for more efficient and accurate WSI classification under limited WSI-labeled datasets, encouraging the broader adoption of unlabeled WSI data in histopathological research. The code is available at https://anonymous.4open.science/r/PreMix
Abstract:Multiple instance learning (MIL) has become a preferred method for classifying gigapixel whole slide images (WSIs), without requiring patch label annotation. The focus of the current MIL research stream is on the embedding-based MIL approach, which involves extracting feature vectors from patches using a pre-trained feature extractor. These feature vectors are then fed into an MIL aggregator for slide-level prediction. Despite prior research suggestions on enhancing the most commonly used ResNet50 supervised model pre-trained on ImageNet-1K, there remains a lack of clear guidance on selecting the optimal feature extractor to maximize WSI performance. This study aims at addressing this gap by examining MIL feature extractors across three dimensions: pre-training dataset, backbone model, and pre-training method. Extensive experiments were carried out on the two public WSI datasets (TCGA-NSCLC and Camelyon16) using four SOTA MIL models. The main findings indicate the following: 1) Performance significantly improves with larger and more varied pre-training datasets in both CNN and Transformer backbones. 2) `Modern and deeper' backbones greatly outperform `standard' backbones (ResNet and ViT), with performance improvements more guaranteed in Transformer-based backbones. 3) The choice of self-supervised learning (SSL) method is crucial, with the most significant benefits observed when applied to the Transformer (ViT) backbone. The study findings have practical implications, including designing more effective pathological foundation models. Our code is available at: https://anonymous.4open.science/r/MIL-Feature-Extractor-Selection