Abstract:Endometrial cancer, the sixth most common cancer in females worldwide, presents as a heterogeneous group with certain types prone to recurrence. Precise histologic evaluation of endometrial cancer is essential for effective patient management and determining the best treatment modalities. This study introduces EndoNet, a transformer-based deep learning approach for histologic classification of endometrial cancer. EndoNet uses convolutional neural networks for extracting histologic features and a vision transformer for aggregating these features and classifying slides based on their visual characteristics. The model was trained on 929 digitized hematoxylin and eosin-stained whole slide images of endometrial cancer from hysterectomy cases at Dartmouth Health. It classifies these slides into low grade (Endometroid Grades 1 and 2) and high-grade (endometroid carcinoma FIGO grade 3, uterine serous carcinoma, carcinosarcoma) categories. EndoNet was evaluated on an internal test set of 218 slides and an external test set of 100 random slides from the public TCGA database. The model achieved a weighted average F1-score of 0.92 (95% CI: 0.87-0.95) and an AUC of 0.93 (95% CI: 0.88-0.96) on the internal test, and 0.86 (95% CI: 0.80-0.94) for F1-score and 0.86 (95% CI: 0.75-0.93) for AUC on the external test. Pending further validation, EndoNet has the potential to assist pathologists in classifying challenging gynecologic pathology tumors and enhancing patient care.
Abstract:In digital pathology, whole slide images (WSIs) are widely used for applications such as cancer diagnosis and prognosis prediction. Visual transformer models have recently emerged as a promising method for encoding large regions of WSIs while preserving spatial relationships among patches. However, due to the large number of model parameters and limited labeled data, applying transformer models to WSIs remains challenging. Inspired by masked language models, we propose a pretext task for training the transformer model without labeled data to address this problem. Our model, MaskHIT, uses the transformer output to reconstruct masked patches and learn representative histological features based on their positions and visual features. The experimental results demonstrate that MaskHIT surpasses various multiple instance learning approaches by 3% and 2% on survival prediction and cancer subtype classification tasks, respectively. Furthermore, MaskHIT also outperforms two of the most recent state-of-the-art transformer-based methods. Finally, a comparison between the attention maps generated by the MaskHIT model with pathologist's annotations indicates that the model can accurately identify clinically relevant histological structures in each task.