Abstract:Publicly available data is essential for the progress of medical image analysis, in particular for crafting machine learning models. Glioma is the most common group of primary brain tumors, and magnetic resonance imaging (MRI) is a widely used modality in their diagnosis and treatment. However, the availability and quality of public datasets for glioma MRI are not well known. In this review, we searched for public datasets for glioma MRI using Google Dataset Search, The Cancer Imaging Archive (TCIA), and Synapse. A total of 28 datasets published between 2005 and May 2024 were found, containing 62019 images from 5515 patients. We analyzed the characteristics of these datasets, such as the origin, size, format, annotation, and accessibility. Additionally, we examined the distribution of tumor types, grades, and stages among the datasets. The implications of the evolution of the WHO classification on tumors of the brain are discussed, in particular the 2021 update that significantly changed the definition of glioblastoma. Additionally, potential research questions that could be explored using these datasets were highlighted, such as tumor evolution through malignant transformation, MRI normalization, and tumor segmentation. Interestingly, only two datasets among the 28 studied reflect the current WHO classification. This review provides a comprehensive overview of the publicly available datasets for glioma MRI currently at our disposal, providing aid to medical image analysis researchers in their decision-making on efficient dataset choice.
Abstract:Machine learning based methods for diagnosis and progression prediction of COVID-19 from imaging data have gained significant attention in the last months, in particular by the use of deep learning models. In this context hundreds of models where proposed with the majority of them trained on public datasets. Data scarcity, mismatch between training and target population, group imbalance, and lack of documentation are important sources of bias, hindering the applicability of these models to real-world clinical practice. Considering that datasets are an essential part of model building and evaluation, a deeper understanding of the current landscape is needed. This paper presents an overview of the currently public available COVID-19 chest X-ray datasets. Each dataset is briefly described and potential strength, limitations and interactions between datasets are identified. In particular, some key properties of current datasets that could be potential sources of bias, impairing models trained on them are pointed out. These descriptions are useful for model building on those datasets, to choose the best dataset according the model goal, to take into account the specific limitations to avoid reporting overconfident benchmark results, and to discuss their impact on the generalisation capabilities in a specific clinical setting